Additional results with short window lengths

Introduction

Since publishing the ASRU 2001 paper we have performed further HTK experiments where we compare the performance of LTLSS with a 1.024 s window length to other mean subtraction approaches which use short window lengths. This page gives details. We examine several kinds of mean subtraction, including cepstral mean subtraction (CMS) and log-DFT mean normalization (LDMN), as well as LTLSS. CMS is sometimes also known as cepstral mean normalization (CMN).

LDMN was not in our paper but it is related to CMS and LTLSS. It is very simple, as it is almost the same as CMS, except that the mean subtraction is done midway through MFCC or PLP feature calculation. The paper "Training Issues and Channel Equalization Techniques for the Construction of Telephone Acoustic Models Using a High-Quality Speech Corpus" by Neumeyer et al. (1994) presented a strong theoretical motivation for LDMN, but in their ASR tests it did not perform significantly better than CMS. However, in our results presented below, LDMN sometimes gives a significant improvement over CMS, although sometimes it performs no better than CMS. The paper "On the effects of short-term spectrum smoothing in channel normalization" by Avendano and Hermansky includes results from artificial tests which also support the idea that LDMN can be useful. The paper "Recognition Of Reverberant Speech Using Frequency Domain Linear Prediction" by Thomas, Ganapathy, and Hermansky (and their related technical report "Front-end for Far-field Speech Recognition based on Frequency Domain Linear Prediction") has ASR results showing LDMN outperforming CMS. Thomas et al. used some of the same test data we did, but they also present results for other test data. Also, they used Asela Gunawardana's newer "complex" version of the Aurora back end, while we used the original version.

Note that by "window length" we are referring to the length of the DFT window function, not the length of time the mean was calculated over. When using the LTLSS code, we chose 32 ms as the short window length, since it is close to the 25 ms window length that we used for MFCC calculation. In every experiment mentioned on this page, we used MFCC coefficients C1-C12 and log frame energy, plus deltas and double-deltas. LDMN involved mean subtraction on 129 log magnitude spectral values (and log frame energy). The results on this page use the same test sets as in the ASRU 2001 paper. As mentioned here, we used the same training set except without the telephone bandwidth filtering.

Before mean subtraction, the test data was noise reduced using this tool. The noise magnitude spectrum estimates for noise reduction were calculated using all utterances for a given speaker in a given recording session. Compared to applying noise reduction to individual utterances, this provides more frames for use in estimating the noise spectrum for noise removel. Means for mean subtraction were also calculated using all the audio for a given speaker in a given recording session, rather than using a sliding window as in the ASRU 2001 and ICSLP 2002 papers. This made it easier to compare CMS and LDMN to the LTLSS code, since we did not have sliding window implementations of CMS and LDMN.

Results

When we present results on this page, we will give first the near mic meeting room digits WER (over 7704 words), then the far (tabletop) mic meeting room digits WER (over the same 7704 words), and then the reverberated TIDIGITS WER (over a 9918-word subset of TIDIGITS). The three WERs will be separated by slash (/) characters. The WERs are in percent (%). The baseline WER for noise reduction alone, without mean subtraction, was 3.7 / 23.4 / 23.2.

Using the experimental design described above, our best result on the far mic meeting room digits was obtained using the LTLSS code with a 32 ms window length and a trimmed mean. CMS, LDMN, and the LTLSS code with a 1.024 s window length all did worse (although the 1.024 s window length did not do worse if we used a sliding averaging window instead). Please see the bolded text below for details.

Using the LTLSS code with a 1.024 s window length gave 2.8 / 8.6 / 6.2. Using the LTLSS code with a 32 ms window length gave 2.8 / 12.4 / 6.8. Using LDMN gave 4.6 / 21.8 / 6.5. Using CMS gave 4.6 / 23.4 / 7.5. For the far mic digits, the improvement of the LTLSS code with 1.024 s over the LTLSS code with 32 ms is statistically significant by a difference of proportions test (P < 0.0001), as is the improvement over the LTLSS code with 32 ms over LDMN (P < 0.0001), as is the improvement of LDMN over CMS (P < 0.01). For the reverberated TIDIGITS, the differences among the first three methods (LTLSS 1.024 s, LTLSS 32 ms, LDMN) were not significant at the 0.01 level, but the improvement of the LTLSS code with 1.024 s over CMS was significant (P < 0.0002), the improvement of the LTLSS code with 32 ms over CMS was not, and the improvement of LDMN over CMS was (P < 0.005).

We were able to greatly improve the performance of the LTLSS code with a 32 ms window length by changing the code to use a trimmed mean which excludes small log spectral magnitude values from the mean calculation. This is intended to focus the mean calculation on portions of the spectrogram containing speech energy. You can download the source code here. For all results reported on this page, when we used a trimmed mean we calculated the mean using the largest 50% of values (that is, only the largest 50% of values for each spectral bin were used to calculate the mean for that bin). Using a 32 ms window length with a trimmed mean gave 2.4 / 7.6 / 5.2. This is better than the results above for the LTLSS code with the 1.024 s window length. The difference between 7.6% and 8.6% on the meeting room tabletop mic digits is statistically significant using a difference of proportions significance test (P < 0.02). So is the difference between 5.2% and 6.2% on the reverberated TIDIGITS (P < 0.002). The results here report a 7.1% WER on tabletop mic digits for LTLSS with 1.024 second DFT window length and a 12.288 second sliding averaging window (for the three test sets the results are 2.8 / 7.1 / 3.6), but the difference between 7.1% and 7.6% is not statistically significant using a difference of proportions test (although we cannot rule out that a different significance test or a larger test set might show a statistically significant difference). If we use a 32 ms window length with a trimmed mean and a 12.288 second sliding averaging window, the results are 2.5 / 7.5 / 5.4 (note the reverberated TIDIGITS result is worse than for the 1.024 second window length with the 12.288 second sliding window).

When we used the trimmed mean approach with LDMN, we obtained 3.6 / 10.8 / 5.0. For the tabletop mic digits, this is considerably worse than using the LTLSS code with a 32 ms window and a trimmed mean (and the difference is statistically significant, P < 0.0001). When we used the trimmed mean approach with CMS, we obtained 3.2 / 11.0 / 8.1. The trimmed mean LDMN result on the reverberated TIDIGITS is much better than the trimmed mean CMS result (and this is statistically significant, P < 0.0001). The trimmed mean LDMN result on the near mic digits is worse than the trimmed mean CMS result, but this was not statistically significant according to our difference of proportions test (although it might be according to a more powerful test).

Instead of using a trimmed mean, we also tried using a frame-level speech detector (we used the speech detector from the Qualcomm-ICSI-OGI package available here) and calculating the mean over only those frames which were judged to contain speech. The fact that this can improve CMS performance was already known. For example, this is shown in the 1996 paper "Deconvolution of telephone line effects for speech recognition" by Mokbel et al. We obtained bigger performance improvements than Mokbel et al., perhaps because we were not working with telephone data. Using the speech detector, we obtained 3.3 / 11.8 / 5.7 for LDMN and 3.5 / 13.8 / 6.7 for CMS. With the speech detector, LDMN's improvement over CMS for the tabletop mic digits is statistically significant (P < 0.0002) but the WER is higher than with the trimmed mean. We did not try using a speech detector when using the LTLSS code with a 32 ms window length, although it might be interesting to try it.

It was plausible that the trimmed mean would outperform the frame-level speech detection because the trimmed mean is bin-level rather than frame-level, and that might be useful since speech is not spectrally uniform. It was also plausible that the use of the frame-level speech detection would outperform the trimmed mean, because speech detection can classify in a more sophisticated way than simply considering how large values are (for example, using a trained statistical model of multiple features). So which method is superior was largely an experimental question, and might be different on different data sets or if a different speech detection approach was used. It should be noted that the speech detection model was not trained on meeting data, which could have lowered its performance; there is some discussion of this in the README file for the Qualcomm-ICSI-OGI package.

Other unpublished results, not given in detail here (please contact David Gelbart if you would like a copy):

We have observed that, without having to use a trimmed mean or speech detection, using noisy training data (various background noises added at various SNRs) made the performance of the 32 ms window on the far mic data similar to that of the 1.024 s window.

We have results where the mean subtraction was performed on each utterance independently, instead of grouping utterances. For this also see the Thomas et al. papers mentioned above.

We have results on in-car speech recordings (the Aurora 3 subset of SpeechDatCar-German), in which using 32 ms window length with a trimmed mean did not perform as well (relative to other approaches) as in the above results.