Double the Trouble: Handling Noise and Reverberation in Far-Field Automatic Speech Recognition

This web page relates to the ICSLP 2002 paper by Gelbart and Morgan.

The noise-reduction code we used can be downloaded here. We kept the waveform output of the noise reduction code in floating-point rather than converting back to 16-bit samples. This is important because the noise reduction code's spectral floor may not be fully preserved with 16-bit samples. There is more discussion of this issue in the Readme.txt file that accompanies the noise reduction code.

Code for the long-term log spectral subtraction is online here, along with other relevant information.

The SNR of the far-field microphone data was estimated as around 9 dB on average, but the use of noise reduction resulted in only a small improvement in word recognition accuracy. After we submitted the paper, visiting researcher Laura Docio performed experiments which shed some extra light. In the new experiments, the log frame energy feature was omitted from the feature vector (but the delta and delta-delta log frame energy features were kept). This makes the features insensitive to a volume level mismatch between training and test. The new experiments are shown in the final column of the table, below. The use of noise reduction made a greater difference to performance when the log frame energy was not used.

In the table below, the noise magnitude spectrum estimates for noise reduction were calculated on all utterances for a given speaker in a given recording session, rather than being calculated separately for for each utterance as in the paper. This way there are more no-speech frames available for use in calculating the noise spectrum estimate.

Near microphones Table top microphone Table top microphone, without log energy

Baseline 4.1 26.3 24.4

Noise reduction alone 3.7 23.4 15.0

Long term log spectral subtraction alone 3.0 8.1 11.0

Noise red. followed by long term log spectral subtraction 2.8 7.1 8.0

The table gives results as percent word error rates.

We have observed that when the training data is much cleaner than the test data, performance may be better if the noise reduction is only applied to the test data and not to the training data. However, for the sake of simplicity (we were more interested in shedding light on experimental questions than in obtaining the best possible word recognition accuracy on the test data), we applied noise reduction to both train and test data in the paper and when calculating the table above.

	Near microphones	Table top microphone	Table top microphone, without log energy
Baseline	4.1	26.3	24.4
Noise reduction alone	3.7	23.4	15.0
Long term log spectral subtraction alone	3.0	8.1	11.0
Noise red. followed by long term log spectral subtraction	2.8	7.1	8.0