Evaluating Long-term Log Spectral Subtraction for Reverberant ASR

This web page provides information, data and source code related to the ASRU 2001 paper by Gelbart and Morgan, which can be found here. Much of the material on this page is from work done after that paper was written. If you are planning to use the long-term log spectral subtraction (LTLSS) method, you are invited to contact David Gelbart so he can make sure you are up to date with the right code, citations and so on (and he would be delighted to include your results in the "Other results with LTLSS" section below).

Matlab and C++ code

A Matlab implementation of LTLSS is available here (files last modified May 2010).

A C++ implementation is available here (files last modified January 2008). The C++ version runs faster and uses less memory. The C++ version also supports the use of a trimmed mean or speech detection (see the section below titled "Speech recognition results with HTK: corrections and new results"), although these could be added to the Matlab code without much effort. There are some other feature differences between the Matlab and C++ code; see the Readme files that come with the code for details.

For the noise reduction and speech detection source code that was used in the ICSLP 2002 paper, see here. Note that this source code can be combined with other algorithms. It is not tied to LTLSS in any way.

There is also a modified version of LTLSS which assigns a minimum-phase phase spectrum to the mean log magnitude spectrum (this was suggested by Birger Kollmeier and Hynek Hermansky). Perhaps we can view the use of a minimum-phase phase spectrum as a way to model the fact that a room impulse response decays over time (so that more energy is in the earlier parts of the room response), as minimum-phase corresponds to minimum energy delay. The code and experimental results are included with the Matlab implementation of LTLSS (see the Readme.txt file).

There is also an (unpublished) version of the algorithm which adds a post-processing stage to remove artifacts. This post-processing could be added to the basic algorithm or to the phase-modified version mentioned above. Due to other commitments, we haven't done ASR tests and we don't have any scheduled plans to do so. If you would like to read about the idea or obtain code, see here.

Impulse responses and other acoustic measurements

You can download the Bell Labs varechoic chamber impulse response we used to add simulated reverberation to TIDIGITS in the ASRU 2001 paper here, together with other impulse responses collected in the same room. We used the impulse response corresponding to microphone number 1 with the panels 43% open.

You can download impulse responses (and other acoustic measurements) measured in the ICSI meeting room here. The ICSI measurements were made after the paper was written so they are not used in the paper.

If you are reading this because you are interested in doing ASR tests of reverberant speech, you might like the Aurora 5 benchmark (see here and here), which includes speech reverberated with a variety of different impulse responses as well as speech from the ICSI meeting room. The benchmark is available from ELRA.

Meeting room digits data

You can download our tabletop mic digits recordings here. (That link points to a newer, much larger set of recordings than the set that was used in the Gelbart and Morgan ASRU 2001 and ICSLP 2002 papers. Because there are more utterances and the utterance segmentation is performed differently, results can't be compared directly to the results in those papers.)

The Aurora 5 benchmark, mentioned above, also makes use of recordings from this set.

If you want to use the same data as was used in the ASRU 2001 and ICSLP 2002 papers, please contact David Gelbart for a copy.

Audio samples

Audio samples of LTLSS output are available here.

Speech recognition results with HTK: corrections and new results

The results published in Table 3 of the ASRU 2001 paper were incorrect for window lengths 0.032 s and 0.256 s, due to unnoticed clipping (saturation due to values that were too large to represent with 16-bit integers) in the output waveforms. Clipping also caused a slight error in the result for window length 0.512 s. Our hypothesis is that shorter window lengths result in louder outputs because with a short length there may be little or no speech in some windows, which can result in negative values in the mean vector. The corrected results below were obtained by writing the processed waveforms as floating point values instead of as 16-bit integer values. Note that by "window length" we are referring to the length of the DFT window function, not the length of time the mean was calculated over.

Window length Near WER Far WER
0.032 s 2.63 19.41
0.256 s 2.66 11.21
0.512 s 3.05 8.94
1.024 s 2.99 7.90
2.048 s 3.30 7.85
4.096 s 3.84 8.42

In our experiment with the Aurora HTK recognition system in the ASRU paper, we used the Aurora clean training set which is telephone bandwidth filtered. That filtering introduced an extra mismatch with the test data. Using the same training data except without the filtering results in a much better baseline (26.3% WER for the tabletop mic, instead of 41.4% WER) but does not appear to much affect the results with log spectral subtraction. In the post-ASRU results that are included and linked to below, we didn't use the telephone bandwidth filtering. In our results with the SRI system, we never used the telephone bandwidth filtering.

Window lengths above 32 ms remain necessary for optimal performance for the far microphone in the corrected results above. However, since publishing the ASRU 2001 paper we have performed further experiments in which we were able to greatly improve the performance of short window lengths. Using a 32 ms window length with a trimmed mean performed better than (or comparably to) LTLSS with a 1.024 s window length. The 32 ms window length results were considerably better than cepstral mean subtraction (CMS), even if we used a trimmed mean with CMS. See here for the results. At that link, we also show good results for the very simple log-DFT mean normalization (LDMN) algorithm proposed by Neumeyer et al. For more LDMN results, see the paper by Thomas, Ganapathy, and Hermansky mentioned below.

We have observed that adding noise to the clean training data can greatly improve performance on reverberant test data. Using the clean TIDIGITS training data, without using any sort of noise reduction or mean subtraction, we obtain a 26.3% WER for the tabletop mic digits and 21.8% WER for the reverberated TIDIGITS. If we add background noises of various types to that training data at various SNRs, still without using any sort of noise reduction or mean subtraction, we obtain a 15.7% WER for the tabletop mic and 13.8% WER for the reverberated TIDIGITS. This raises the interesting possibility that the artifact noises introduced by LTLSS might actually be helping performance when the LTLSS is used with clean training data and reverberant test data. For more about these artifacts see here.

ASRU 2001 SRI system results

The table below is an expanded version of the results table for the SRI DECIPHER system in the ASRU paper. The baseline system here is stronger than our HTK baseline. The results are given as word error rates (WER) in percent. Each result is given as two WER values. The value before the slash is for no speaker adaptation and the one after the slash is for MLLR speaker adaptation. The rows labeled LTLSS are the results when the long-term log spectral subtraction processing was used on train and test data. In the clean TIDIGITS and MR near mic test cases, the use of LTLSS lowered performance. This may be due to the artifacts created by the LTLSS, which can be heard in the audio samples. One theory is that these artifacts appear because the LTLSS changes the magnitude spectrum without corresponding changes to the phase spectrum (the phase spectrum contains a greater proportion of the speech information for long DFT window lengths than it does for short ones, as discussed in Oppenheim and Lim's 1981 paper on phase). The use of LTLSS also lowered performance in the MR far mic case when both the large training set and adaptation were used. The amount of performance loss in that case is larger than reported in the paper. Due to a calculation error we reported a -2% relative change in WER when the true value is -20%. It would be interesting to know what performance would be for window lengths shorter than the window length we used, but longer than the window length used in CMS. Perhaps in that case the artifact problem would be less serious.

Large train setSmall train set
Near (MR) 2.0/1.5 2.3/1.2
Near (MR) LTLSS 2.9/2.3 4.2/2.7
Clean (TI) 0.7/0.5 0.5/0.4
Clean (TI) LTLSS 1.0/0.8 1.2/0.8
Artificial reverb (TI) 9.2/4.1 16.3/9.7
Artificial reverb (TI) LTLSS 6.1/3.5 5.1/3.1
Far (MR) 4.8/3.0 12.7/5.1
Far (MR) LTLSS 4.5/3.6 6.4/4.2

Other results with LTLSS and LDMN

Please contact David Gelbart if you have results so that they can be noted here.

Gelbart and Morgan, ICSLP 2002

Further results are available in the ICSLP 2002 paper by Gelbart and Morgan (available on the Publications page), and on this web page which accompanies that paper.

Gelbart, AVIOS 2002

Results using an "online" version of LTLSS with less algorithmic delay are available in the AVIOS 2002 paper by Gelbart, which is available on the ICSI Publications page.

Pujol, Nadeu, Macho and Padrell, ICSLP 2004

The ICSLP 2004 paper by Pere Pujol, Climent Nadeu, Dusan Macho, and Jaume Padrell, gives results on the SPEECON database. They used 0.256 s window length instead of 1 s. The best window length might depend on the degree of reverberation or the training data being used, and perhaps other factors.

Thomas, Ganapathy, and Hermansky, IEEE Signal Processing Letters, 2008

The paper is titled "Recognition Of Reverberant Speech Using Frequency Domain Linear Prediction".

There is also a related IDIAP report "Front-end for Far-field Speech Recognition based on Frequency Domain Linear Prediction" .

In these articles, LDMN and LTLSS perform better than cepstral mean subtraction (CMS). (The LTLSS code is run using a 32 ms window length. In this paper, the processing is done on each utterance independently, rather than grouping utterances together, which is why the authors decided to use a short window length rather than 1.024 s as in the Gelbart and Morgan ICSLP 2002 paper.) And the authors' new FDLP-based feature extraction methods perform better than both LDMN and LTLSS. FDLP source code is available here and a web page with more details is planned.

The two articles are related but not identical. The "Recognition Of Reverberant Speech..." journal paper uses FDLP-based features directly in ASR. The "Front-end for Far-field Speech Recognition..." report uses the same basic concept but presents a way to turn the output into a time-domain waveform, which can then be used as input into other ASR feature extraction algorithms such as MFCC or PLP.

The "Recognition Of Reverberant Speech..." journal paper is accompanied by an MLMI 2008 paper titled "Hilbert Envelope Based Features for Far-Field Speech Recognition" which contains additional experimental results. The "Front-end for Far-field Speech Recognition..." report is accompanied by an INTERSPEECH 2008 paper.

Gelbart, unpublished

See the sections "Speech recognition results with HTK: corrections and new results" and "ASRU 2001 SRI System Results" above.

University of Erlangen-Nuremberg, unpublished

Tino Haderlein of the University of Erlangen-Nuremberg kindly sent us the following unpublished experimental results on EMBASSI data. The following is quoted from Tino's email messages:

I had to ask [Reinhard Weiß] who did all the experiments. ... He says the [LTLSS] performance was worse than with KNN but the KNN was trained on a very specific combination of close-talk and room microphone which also occured in the test data. So you actually can't compare the results.

The training data was for all recognizers taken from Embassi session 5 and 10, speakers 1-12, 60 sentences each; validation data were speakers 13 and 14 from the same sessions.

The test data were always far-distant (2.5 meters) recordings from Embassi session 10 (speakers 15-20, 60 sentences each).

On a recognizer trained with the close-talk data only, the far distant test recordings showed the worst results, of course (62% WER), and on a recognizer trained with far-distant recordings the best WER (36%). The WER of the other tested approaches was between those:

  1. The close-talk recognizer, but with a filtering KNN during the feature extraction: The KNN is trained on about 5 minutes of the synchronously recorded Embassi data and makes the conversion of the far-distant to the close-talk signals. WER: 47%
  2. The close-talk recognizer, but with LTLSS as preprocessing operation (parameters from [Gelbart 2001]) on all signals (training, validation, test; close-talk and far-distant). WER: 52%

As the data was all from the EMBASSI corpus it was all from the room described in my paper from TSD 2003 (where I presented the mu-law features).

Below are some results.... Comparing the close-talk recognizer and the recognizer trained with LTLSS-filtered close-talk files you can see that the recognition for the distant-talking microphones works much (significantly) better. [However] the speech signals of the EMBASSI corpus are very short (about 2-4 seconds) which is about the size of the averaging window. This might cause unreliable results.

Table 1: WER of the recognizer trained with close-talking signals for test data from four different microphones (headset and three array microphones); results are given for the training set (rectrain), validation set (recvalid) and test set (rectest) of the recognizer.
data set Close-Talk Mic. 01 Mic. 06 Mic. 11
rectrain 11.00 46.44 43.62 50.12
recvalid 17.94 53.35 53.81 62.45
rectest 31.91 57.46 56.29 60.41

Table 2: WER of the recognizer trained with distant-talking signals (from microphone 06) for test data from four different microphones (headset and three array microphones).
data set Close-Talk Mic. 01 Mic. 06 Mic. 11
rectrain 42.91 21.12 14.77 22.11
recvalid 48.06 29.42 26.97 28.90
rectest 61.01 35.46 35.53 38.06

Table 3: WER of the recognizer trained with LTLSS-filtered close-talking signals (LTLSS parameters from [Gelbart01]) for test data from four different microphones (headset and three array microphones).
data set Close-Talk Mic. 01 Mic. 06 Mic. 11
rectrain 14.79 32.97 31.66 34.79
recvalid 26.77 40.97 39.23 41.87
rectest 38.01 48.68 48.95 50.72