This page explains how to download and set up the noisy ISOLET data and ASR (automatic speech recognition) back ends used in
If you publish any results using the data or back ends on this page, please let David Gelbart know so that your publication can be added to the list above.
Note that this page links to three different ASR back ends: two that are built with HTK and use Gaussian mixture models (GMMs) and one that is built with Quicknet / SPRACHcore and uses multi-layer perceptrons (MLPs). The INTERSPEECH 2008 paper compares the GMM and MLP approaches.
We have created another web page which provides a noisy version of the OGI Numbers corpus and an ASR back end for that corpus.
Quoting the Readme.txt file for OGI's ISOLET corpus: "ISOLET is a database of letters of the English alphabet spoken in isolation. The database consists of 7800 spoken letters, two productions of each letter by 150 speakers. It contains approximately 1.25 hours of speech [actually, 1.25 hours of audio including before-word and after-word pauses]. The recordings were done under quiet, laboratory conditions with a noise-canceling microphone."
One reason ISOLET was attractive for our work is that it provides a phonetically broad ASR task through its inclusion of all letters of the alphabet. Another attractive reason was the relative small size of the corpus, which lowers the turnaround time for experiments.
To help provide meaningful experimental results, and following Karnjanadecha and Zahorian's example in , we used 5-way cross-validation as a way to increase the experimental usefulness of the corpus without increasing the amount of speech data. For this we follow the same division of ISOLET into 5 parts that is defined on the ISOLET CD-ROM. In all 5 ways that this is possible, we trained on 4 of the 5 parts and tested on the remaining 1 part. (Our MLP-based ASR system is designed to use one of these five possibilities as development data for tuning decoder parameters.)
When creating the noisy ISOLET data, we followed these principles from the popular Aurora 2 task :
We chose to use the RSG-10  collection as a source of noises. This is the same collection that provided the noises for the NOISEX-92 corpus (apparently some of the ASR literature refers to these noises under the name NOISEX-92 rather than RSG-10). We chose this because it provides a variety of noises, which we were able to obtain without charge (and obtained permission to redistribute for our purpose without charge).
We selected eight of the RSG-10 noises for use, and downsampled them to 16 kHz to match the sampling rate of ISOLET. See this page for more information about the noises.
To add noise to ISOLET utterances at chosen SNRs, we used Guenter Hirsch's FaNT tool. We used A-weighted SNR calculation. (There is one exception to this: in the INTERSPEECH 2005 paper we used an earlier version of the noisy data for which we used unweighted full-bandwidth SNR. Afterwards, we decided that A-weighted SNR is better suited to our goals.) For more information about how the FaNT tool calculates SNR, see the tool's manual.
The noises were added to ISOLET utterances starting at random starting points within the noise recordings. A different random starting point was picked for each ISOLET utterance (the noise recordings are much longer than the individual ISOLET utterances). This is a feature of the FaNT tool.
The ISOLET data comes divided into 5 equal-sized parts (ISOLET1 through ISOLET5). To create our degraded ISOLET corpus, we added one of the eight noise types to each utterance at one of 6 different signal-to-noise ratios: clean (no noise added), 20 dB, 15 dB, 10 dB, 5 dB, and 0 dB. Three of the noise types were used for all 5 parts of ISOLET, and the remaining 5 noise types were used for 1 part each. This means that for each of the 5 different divisions of ISOLET into training and test data that we use, there are 3 matched noise types (found in both training and test, as in test set A in the Aurora 2 corpus) and 1 mismatched noise type (found only in test, as in test set B in the Aurora 2 corpus).
Thus the design of the noisy version of the ISOLET corpus was heavily influenced by Aurora 2. However, we believe that the ISOLET corpus makes it easier than Aurora 2 to make a distinction between development data used to make design choices or adjust system parameters and evaluation data used to report final results, since for ISOLET there are five possible ways to divide the corpus into training and test data and one of the five can be used as development data.
Scripts, configuration files, and noise recordings that can be used to duplicate our noisy ISOLET corpus can be downloaded here. You must have a copy of the original ISOLET corpus to use these.
The scripts and configuration files for the HTK-based recognizer used in the 2007 Speech Communication paper and the INTERSPEECH 2008 paper can be downloaded here. Compared to the older version that we used in the INTERSPEECH 2005 paper, this version adds mixup (which might improve training robustness), adds a pause model (which might improve recognition performance), and has better documentation and a more straightforward command line interface. With this version, we have been able to reliably train larger acoustic models than with the old version.
The downloadable files were updated in April 2008 with an improved README (including more discussion of the appropriate acoustic model size) and a script showing start-to-finish examples of creating the noisy version of ISOLET, creating features for the clean and noisy versions, and running HTK recognition and training. The file README_MFCC contains a note from March 2008 about MFCC calculation.
For the INTERSPEECH 2008 paper, we reported results on the full task and on a vowel subset (the words (a, e, i,o, u and y) of the full task. When using the vowel subset, we trained word HMMs for only the words in the vowel subset.
For the scripts and configuration files for the multi-layer perceptron (Quicknet) / hidden Markov model (noway decoder) recognizer used in the INTERSPEECH 2008 paper and Gelbart's PhD thesis, see here.
For the INTERSPEECH 2008 paper, we reported results on the full task and on a vowel subset (the words a, e, i,o, u and y) of the full task. For the vowel subset experiments, we used an MLP trained on the full task with a language model restricted to the vowel subset.
For the experiments in our INTERSPEECH 2005 paper, we used an HTK configuration based on the configuration used in , the scripts and parameter files for which were kindly shared with us by the authors. We initially experienced occasional model training errors when using our auditory features (although not for MFCCs), so for greater training robustness we decreased the number of model parameters compared to , by using four-state HMMs with each state modeled by a mixture of three diagonal-covariance Gaussians, and we added a variance floor. (With this new configuration, we still experienced a few "zero occ" model training errors, when using auditory feature vectors without DCT or KLT applied. This presumably relates to the limited amount of training data. Of course, none of these errors were for experiments included in the paper.) When using MFCC features and training and testing on clean ISOLET data without noise added, this system gave 92.7% accuracy (averaged over the five different divisions of the corpus utterances into train and test data).
The HTK-based recognizer (HTK train/test script and HTK configuration files) we used for our INTERSPEECH 2005 paper, together with a script for statistical significance testing, can be downloaded here. We are providing this download link as a form of documentation for the INTERSPEECH 2005 paper. If you are simply looking for an HTK-based recognizer for ISOLET, we recommend you use the newer version from our 2007 Speech Communication paper.
David Gelbart organized the corpus and wrote this page. The selection of ISOLET as the source for clean data, and the choice of noises, was carried out by David Gelbart, Werner Hemmert, Marcus Holmberg, and Nelson Morgan. Hans-Guenter Hirsch wrote the FaNT noise adding tool that was used to create noisy data, and added several features to it for this project. Herman Steeneken gave us permission to redistribute the RSG-10 noises that we used to create the noisy data.
The original versions of the HTK training and recognition scripts and configuration files were from the project described in  and were authored by Montri Karnjanadecha. Many thanks to him, Penny Hix and Stephen Zahorian for sharing these with us. David Gelbart adapted the scripts to make them easier to use with the ISOLET benchmark described on this page. Variance flooring during training (following the model of the variance flooring in the Aurora 2 training scripts) was added by Chuck Wooters. Mixup was added by Hans-Guenter Hirsch.
 H. G. Hirsch and D. Pearce, The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Conditions, ISCA ITRW ASR2000. For ISCA members, the paper is also available here.
 Montri Karnjanadecha and Stephen Zahorian, Signal Modeling for High-Performance Isolated Word Recognition, IEEE Transactions on Speech and Audio Processing, September 2001.
 H. Steeneken and F. Geurtsen. Description of the RSG-10 noise database. Technical report, TNO Institute for Perception, The Netherlands, 1988.