Noisy Numbers data and Numbers speech recognizer

This web page describes a noisy version of the OGI Numbers corpus (version 1.3) and provides tools and data that you can use to reproduce the noisy version yourself. We also provide the recognition scripts for a neural net (multi-layer-perceptron) HMM speech recognizer for the clean and noisy versions of Numbers 1.3. Note that some people refer to all versions of the Numbers corpus as Numbers95 or Numbers 95, as in citation [1] in the bibliography below, while others will use the term Numbers95 to refer to an earlier version with much less data than version 1.3.

For a similar web page for the ISOLET corpus, see here.

The noisy Numbers corpus and speech recognizer provided here were used in PhD thesis work on ensemble feature selection by David Gelbart at ICSI under advisor Nelson Morgan. Also, an earlier version of the noisy Numbers corpus provided here, which was based on version 1.0 of the Numbers corpus (Numbers95) rather than on version 1.3, was used in the INTERSPEECH 2008 paper "Multi-Stream Spectro-Temporal Features for Robust Speech Recognition" by Sherry Zhao and Nelson Morgan. If you publish results using the noisy corpus or speech recognizer provided here, please let David Gelbart know so that he can add your work to this list.

Numbers

The Numbers corpus was produced by the Oregon Graduate Institute's Center for Spoken Language Understanding. The corpus consists of strings of spoken numbers collected over telephone connections. Numbers version 1.3 contains much more data than the original 1.0 release.

For creation of a noisy version of Numbers, we divided the corpus into train, validation and test sets exactly as in IDIAP's benchmark [1]. These respectively contain about 6.2, 2.2 and 2.3 hours of audio. It can be useful to view the validation set as development test data and the test set as evaluation test data.

The design of the noisy version of the Numbers corpus was heavily influenced by Aurora 2. However, we believe that the Numbers corpus makes it easier than Aurora 2 to make a distinction between development data used to make design choices or adjust system parameters and evaluation data used to report final results.

The three sets do not contain every utterance in the Numbers corpus because IDIAP kept only "the sentences containing only the 30 most frequent words" and removed "the sentences containing truncated words". In the IDIAP file lists, there are 10441 train sentences, 3582 validation sentences, and 3621 test sentences. Our copy of the sentence 616/NU-61634.street turned out to be empty, leaving 3620 test sentences. We used the remaining sentences from the IDIAP list to create our noisy corpus.

When creating our hybrid HMM / multi-layer perceptron recognition testbed (described below), we removed additional sentences from the IDIAP lists of training, validation and test files due to transcription errors or other problems discovered by us or the authors of [6]. But when creating this noisy corpus we have stuck to the original IDIAP sentence lists (except for the removal of 616/NU-61634.street) so that the noisy corpus can be used by people who are using the IDIAP sentence lists.

For information about IDIAP's recognizer configuration for the clean Numbers version 1.3 corpus, and the corresponding word accuracy results, see [1].

The creation of the noisy Numbers data

When creating the noisy Numbers data, we followed these principles from the popular Aurora 2 task [2]:

Multiple noise types
Multiple SNRs
Both clean and noisy training conditions

We chose to use the RSG-10 [3] collection as a source of noises. This is the same collection that provided the noises for the NOISEX-92 corpus (apparently some of the ASR literature refers to these noises under the name NOISEX-92 rather than RSG-10). We chose this because it provides a variety of noises, and Herman Steeneken gave us permission to redistribute the RSG-10 noises that we used to create the noisy Numbers data. We selected ten of the RSG-10 noises for use, and downsampled them to 8 kHz to match the sampling rate of Numbers.

We added one of the ten noise types to each utterance at one of 6 different signal-to-noise ratios: clean (no noise added), 20 dB, 15 dB, 10 dB, 5 dB, and 0 dB.

Noises used with the training set:

Speech babble
Factory floor noise 1
Car interior noise
F-16 cockpit noise
Factory floor noise 2
Buccaneer cockpit noise - 190 knots

Noises used with the validation set:

Speech babble
Factory floor noise 1
Car interior noise
F-16 cockpit noise
M109 tank noise
Buccaneer cockpit noise - 450 knots

Noises used with the test set:

Speech babble
Factory floor noise 1
Car interior noise
F-16 cockpit noise
Destroyer operations room noise
Leopard military vehicle noise

Since the Numbers data was collected over telephone channels and the noises were not, we applied the MIRS filter from ITU Software Tools Library [4][5] to the noises before adding them to Numbers data. Our intent was to make the noises more like as if they had been collected over the same sort of telephone channels as the Numbers data was collected over. We did not examine how well the MIRS filter matches the telephone channels the Numbers data was collected over. This same MIRS filter was used in the creation of some of the Aurora 2 data [2]. We used Guenter Hirsch's FaNT tool to apply the MIRS filter.

To add the filtered noises to Numbers utterances at chosen SNRs, we again used the FaNT tool. We added the noise using a non-frequency-weighted SNR calculation using FaNT's '-d -m snr_4khz' options. (We did not use frequency weighting in the SNR calculation because the audio was already filtered to telephony bandwidth.)

The noises were added to Numbers utterances starting at random starting points within the noise recordings (the noise recordings are much longer than the individual Numbers utterances), resulting in more variety. The random offsets were read from files so that other sites can use the same offsets when they re-create the noisy Numbers corpus.

Scripts, configuration files, and noise recordings that can be used to duplicate our noisy Numbers corpus can be downloaded here.

Past users of Numbers version 1.0 at ICSI often padded the data with 100 milliseconds of zeros at both the beginning and end of each utterance. This padding was intended to help with, for example, allowing time for "filter warmup and warmdown" in feature extraction. For simplicity, we did not do this.

Hybrid HMM / multi-layer perceptron recognizer

Scripts and configuration files for performing speech recognition on OGI's Numbers 1.3 corpus using the Quicknet multi-layer perceptron (MLP) package for acoustic modeling and the noway hidden Markov model (HMM) decoder can be downloaded here. A tutorial is included which is meant to make it easy for people new to Quicknet to use the scripts. The scripts support recognition on both the original Numbers 1.3 corpus and the noisy version described above. Support for multi-stream speech recognition using a different feature vector for each stream is built in.

In November 2008 we added new duration modeling options that can be used to improve speech recognition accuracy (see the README_DURATION file that we have included). In September 2008 we added more error checking. In July 2008 we added significance testing tools for the matched pairs sign test and changed the discussion of significance testing in the README to discuss that test rather than MAPSSWE. In June 2008 we added an important clarification to the README about how to do multi-threaded MLP training properly, without choosing a mlp3_bunch_size value that is too high. In May 2008, we added the mergePosteriors tool. This tool is used to merge streams in a multi-stream system. Previously, the tool to do this was available only as a Linux binary, but the new mergePosteriors tool is a regular script like the other scripts.

Baseline performance

Both the MLP-based recognizer described above and the GMM system described in [1] have about 5% word error rate (WER) on the Numbers 1.3 validation set, without added noise.

The best result we know of is 2.0% WER. This is an unpublished result which was achieved by Arlo Faria in 2008 using Numbers (Numbers95) 1.0.

Faria used the SRI DECIPHER recognizer with cross-word triphones. Each state was represented with 16 full covariance Gaussians, and states were tied using decision-tree clustering. The front end used PLP and MFCC features in a tandem approach: PLP features were used as input to an MLP classifier, and the outputs of the MLP were then reduced to 21 dimensions using PCA and concatenated with MFCC features to form a 60-dimensional feature vector for the main ASR system. (For more on tandem, see the README file that comes with the hybrid HMM / MLP recognizer described above.) Three speaker-level normalization techniques (cepstral mean normalization,cepstral variance normalization and vocal tract length normalization) were used, and recognition was performed in a single pass without maximum likelihood or maximum a posteriori speaker adaptation.

It appears from Table 4 in [1], which compares Numbers version 1.0 to version 1.3, that version 1.0 was at least as difficult a task as version 1.3. So it is probably fair to compare Faria's result to the other results.

Bibliography

[1] Johnny Mariéthoz and Samy Bengio, "A New Speech Recognition Baseline System for Numbers 95 Version 1.3 Based on Torch". Technical report, IDIAP, 2004. Available online here or elsewhere on the IDIAP site.

[2] Hans-Guenter Hirsch and David Pearce, "The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Conditions", ISCA ITRW ASR2000. For ISCA members, the paper is also available here.

[3] H. Steeneken and F. Geurtsen. "Description of the RSG-10 noise database". Technical report, TNO Institute for Perception, The Netherlands, 1988.

[4] Simão Ferraz De Campos Neto, "The ITU-T Software Tool Library". International Journal of Speech Technology, Vol. 2, Number 4, May 1999. Available online through SpringerLink.

[5] ITU Software Tools Library (G.191 Annex A). Available online through the ITU.

[6] Joe Frankel and Mirjam Wester and Simon King, "Articulatory feature recognition using dynamic Bayesian networks", Computer Speech and Language, Vol. 21, Number 4, October 2007.