The ICSI RESPITE AURORA Multstream recognizer


Note: the most recent results are on the Aurora 2000 diary page .

As part of the the EC RESPITE collaboration, we are developing a recognizer for the AURORA noisy-digits task, based on multi-stream techniques. This page contains details of our progress.

The AURORA task is based on the TIDIGITS continuous digits task, but includes a range of background-noise corruptions at a variety of signal-to-noise ratios. It is thus a useful domain for the development of robust features. The full task includes a standard HTK-based Gaussian-mixture recognizer. We, however, have developed our own hybrid connectionist-HMM recognizer to use instead.

p.s. Don't miss the companion Feature Statistics Comparison Page or the Page Describing How Bootlabels Were Created.


Summary results

Summary WER results are averaged across all 4 noise types. Click on any description to go to that section of the page.

Date Desciption Clean WER SNR15 WER SNR5 WER SNR-5 WER Ratio to HTK
1999apr19 HTK baseline 1.4% 3.7% 15.9% 68.0% 100.0%
1999apr19 HTK MSG baseline 4.4% 8.1% 25.2% 76.9% 193.1%
1999may28 NN rasta8+d 2.5% 4.3% 15.2% 70.3% 120.0%
1999jun01 NN msg3N 2.1% 2.9% 11.6% 49.2% 87.1%
1999jun03 NN msg3N+ras8d 1.3% 2.2% 9.5% 48.1% 66.0%
1999jun06 NN msg3N-ras8d 1.4% 2.8% 10.5% 47.1% 74.9%
1999jun07 NN ras8+d+dd 2.2% 3.7% 12.7% 65.3% 102.9%
1999jun09 NN plp12N 2.8% 3.4% 13.0% 52.5% 104.2%
1999jun09 NN plp12N+d 2.6% 2.8% 10.6% 47.9% 89.6%
1999jun09 NN plp12Nd+msg3N 1.3% 1.9% 8.5% 43.9% 60.6%
1999jun29 NN 4-way combo 1.8% 2.4% 10.5% 49.6% 76.5%
1999jul22 HTK msg3N 6.0% 7.8% 23.2% 66.0% 205.0%
1999jul22 HTK msg3NK 7.0% 7.4% 24.8% 72.2% 220.5%
1999jul22 NN msg3NK 2.1% 3.1% 11.4% 49.6% 89.5%
1999jul29 HTK msg3NKG 5.6% 6.4% 21.5% 66.8% 184.5%
1999jul29 NN msg3NKG 2.2% 3.2% 11.9% 51.2% 93.3%
1999jul29 HTK msg3NG 5.9% 6.5% 22.1% 69.9% 190.1%
1999jul29 NN msg3NG 2.1% 2.9% 12.0% 52.3% 87.7%
1999aug03 HTK msg3NGK 6.7% 7.5% 27.3% 72.2% 210.3%
1999aug03 NN msg3NGK 2.3% 3.2% 12.1% 52.5% 93.3%
1999aug09 HTK lna1 1.1% 1.9% 8.2% 46.1% 59.1%
1999aug11 HTK lna1L 0.9% 1.8% 8.9% 48.8% 58.6%
1999aug11 HTK lna1LG 2.0% 3.5% 12.6% 56.0% 99.8%
1999aug12 HTK lin-plp 1.7% 2.5% 9.9% 47.4% 73.8%
1999aug12 HTK lin-msg 1.6% 2.5% 9.5% 47.6% 71.4%
1999aug13 HTK lin-sum 0.9% 1.6% 7.7% 44.1% 51.6%
1999aug29 NN mfcc13N+d 2.2% 2.6% 9.9% 49.1% 82.1%
1999nov05 NN mfcc13+d+dd 1.6% 2.6% 8.7% 72.7% 84.6%
1999sep15 NNmfcc train-on-clean 2.2% 9.9% - - 221.2%
1999oct20 NN plp12 d+dd 1.7% 2.6% 8.7% 70.9% 82.4%
1999oct19 NN msg3 nonorm 1.7% 2.6% 9.5% 54.5% 78.4%
1999oct16 HTK lin-sum+KLT 0.7% 1.5% 7.2% 42.5% 47.2%
1999oct20 HTK OGI lda 1.3% 2.5% 8.8% 56.9% 74.1%
1999oct23 HTK OGI trap 3.8% 3.8% 12.1% 55.8% 119.5%
1999oct22 HTK lda+trap-KLT 1.1% 1.8% 7.7% 45.3% 55.5%
1999oct21 HTK plp+msg+lda+trap-KLT 0.8% 1.3% 6.6% 47.0% 46.9%
1999oct27 HTK plp12dd+msg3+trap KLT 0.7% 1.3% 6.5% 44.0% 44.9%


HTK baseline (1999apr19)

Brian Kingsbury installed the basic AURORA package and got it working at our site. From this, we get the baseline HTK results across the 28 testsets (4 noise types by 7 signal-to-noise ratios). Here they are expressed in Word Error Rate percentages (WER%):

	HTK baseline:

		Hall	Babble	Train	Car	avg
	CLEAN	 1.35	 1.51	 1.52	 1.36	 1.44
	SNR20	 2.40	 3.36	 2.00	 1.57	 2.33
	SNR15	 3.87	 6.17	 2.74	 1.91	 3.67
	SNR10	 7.15	13.63	 5.22	 2.44	 7.11
	SNR5	16.92	29.78	11.78	 5.25	15.93
	SNR0	52.69	50.70	33.46	20.36	39.30
	SNR-5	81.46	68.32	69.43	52.79	68.00

	Mean ratio to base = 100.0% (by definition)

These match the standard results shipped with the AURORA task, which I think are 11th order MFCCs with deltas and double-deltas (for 36 elements per feature vector) and separate multi-state whole-word models for each of the 13 words (0-9, oh, sil and ??). I think states were automatically tied between words.

Brian also used the standard AURORA set-up to test his Modulation-filtered Spectrogram features as input to an HTK GMM/HMM system:

	HTK system using MSG features:

		Hall	Babble	Train	Car
	CLEAN	 4.45	 4.14	 4.38	 4.63
	SNR20	 5.29	 7.35	 4.47	 4.47
	SNR15	 7.46	13.72	 6.11	 4.91
	SNR10	11.64	26.21	 8.56	 5.92
	SNR5	26.25	48.04	17.51	 8.95
	SNR0	59.07	76.66	41.63	19.65
	SNR-5	85.66	100.2	72.77	48.78

	Mean ratio to base = 193.1%

The MSG system significantly under-performs the baseline MFCC system, although it has a smaller feature vector and thus fewer parameters. One factor could be that the modulation-spectral features are not orthogonalized (i.e. they are spectral rather than cepstral), which might present a greater burden on the GMMs (although we have seen no benefit from using the DCT with our neural-network classifiers).

Brian also tried combining the two HTK systems, but the results were in most cases worse than either system alone, suggesting a very unsuccessful combination strategy.

RASTA baseline (1999may28)

Our standard recognizer framework at ICSI is the so-called hybrid connectionist-HMM system, where a neural network replaces the Gaussian mixtures as the acoustic model, estimating the posterior probability of each label class given a temporal context of feature vectors. We developed the RASTA-PLP features as a robust response to the problems of variable signal level and channel characteristics. The simple standard baseline system whose results are shown uses 9 RASTA-PLP cepstral coefficients plus deltas, a nine-frame temporal context window, and a 480 hidden-unit neural network. (It also uses the /q/ phone for "oh oh" disambiguation).

	rasta8+d (18x9):480:56 iteration 5+q:

		Hall	Babble	Train	Car
	CLEAN	  2.4	  2.6	  2.4	  2.4
	SNR20	  2.4	  5.4	  2.4	  2.5
	SNR15	  4.1	  7.4	  2.8	  2.8
	SNR10	  7.2	 13.5	  4.7	  3.8
	SNR5	 17.0	 26.0	 10.3	  7.5
	SNR0	 46.0	 47.2	 30.7	 22.5
	SNR-5	 83.6	 68.7	 69.8	 59.1

	Mean ratio to base = 120.0%

The "Mean ratio to base" is the average ratio between to the WER figures here and those of the HTK baseline reported above. We see that the RASTA baseline is about 20% worse than the HTK baseline in this configuration.

Adding MSG information (1999jun01)

We have recently had good results from a novel feature type, known as the Modulation-filtered Spectrogram or MSG. This feature filters energy envelopes in Bark-scaled subbands to emphasize different bands of modulation energy, then applies several stages of automatic gain control to the result. The particular form used here comprises two banks, covering roughly the 0-8 and 8-16 Hz modulation bands. The features taken alone perform well, significantly improving on the HTK baseline:

	msg3N (28x9):480:56 iteration 2:

		Hall	Babble	Train	Car
	CLEAN	  1.9	  2.4	  2.1	  1.8
	SNR20	  2.0	  3.2	  1.8	  1.8
	SNR15	  2.9	  4.4	  2.4	  2.0
	SNR10	  5.1	  8.6	  3.6	  2.6
	SNR5	 12.9	 20.0	  8.9	  4.4
	SNR0	 30.8	 41.8	 19.0	 10.5
	SNR-5	 57.2	 67.8	 45.6	 26.0

	Mean ratio to base = 87.1%

The posteriors calculated in the hybrid model permit a very simple kind of model combination, where the posterior probabilities from several models are simply averaged in the log domain (geometric average). This works remarkably well, because a poor acoustic match tends to result in equivocal posterior estimates, which are then 'washed out' in combination with a better model. Combining the MSG with our earlier RASTA model gives a significant improvement, despite the large difference in baseline performance:

	ras8+d-480h-i5+q + msg3N-480h-i2 LNA combo:

		Hall	Babble	Train	Car
	CLEAN	  1.3	  1.7	  1.6	  1.2
	SNR20	  1.6	  2.6	  1.3	  1.5
	SNR15	  2.4	  3.8	  1.8	  1.9
	SNR10	  4.6	  7.8	  3.1	  2.7
	SNR5	 11.3	 18.8	  7.4	  4.3
	SNR0	 32.1	 43.2	 19.5	 11.5
	SNR-5	 69.5	 68.7	 53.4	 36.6

	Mean ratio to base = 77.0%

Note that in a few of the high-noise conditions, combining gives a worse error rate than the MSG model alone. Hopefully, improving the RASTA model can eliminate this.

Improved MSG/RAS combo (1999jun03)

I did some more combination experiments, testing the hypothesis that it might be better to have separate nets for the two MSG 'banks' (the 0-8 and 8-16 Hz modulation bands). This combination performed slightly worse than the single net trained on both bands, so there may be a requirement for the information to be more independent for posterior-combination to be successful (I'm assuming, based on our previous experience, that training a single net on both rasta and MSG features wouldn't do as well as the combination, but I guess I should test that).

However, I did notice that the insertion/deletion ratio for the multi- combinations (msg3a + msg3b + ras) was very skewed - with more than 10x as many deletions as insertions - indicating poorly tuned decoder parameters. Playing with the noway decoder's phone_deletion_penalty allowed me to optimize this, so I went back to look at the previous msg+ras combo. This was also improved by an adjusted transition penalty, which I optimized by a search over decodes of the 800 utterances held out from the training set for cross-validation, which includes a mix of SNRs:

        ras8+d-480h-i5+q + msg3N-480h-i2 LNA combo, pdp=0.25:

	WER%	Hall	Babble	Train	Car
	CLEAN	  1.2	  1.4	  1.0	  1.4
	SNR20	  1.3	  2.9	  1.0	  1.3
	SNR15	  2.0	  4.1	  1.3	  1.4
	SNR10	  4.2	  7.9	  2.6	  2.0
	SNR5	 10.4	 18.0	  5.7	  3.7
	SNR0	 26.4	 39.3	 15.8	  8.6
	SNR-5	 57.7	 64.8	 42.4	 27.5

        Mean ratio to base = 66.0%

Single MSG-RAS net (1999jun06)

The previous result used both rasta and msg features by averaging the log-posterior-probabilities coming out of each of two separate networks; a more direct method to using both sets of features would be to train a single network on both sets of features at once. Indeed, given the vaunted modeling power of the nets, this would normally be expected to be the best thing to do.

I trained such a net for the rasta and msg features. This net has 18 rasta+d features and 28 msg3N features for a total input layer size of (9x(18+28)) = 414 units. I kept the hidden layer fixed at 480 units, so that the total parameter count is approximately the same as the sum of those in the two nets for the combo system above.

        msgras 1net PDP=0.3

	WER%	Hall	Babble	Train	Car
	CLEAN	  1.2	  1.5	  1.3	  1.6
	SNR20	  1.8	  3.7	  1.0	  1.5
	SNR15	  2.8	  5.0	  1.7	  1.6
	SNR10	  5.3	  9.7	  3.3	  2.4
	SNR5	 11.4	 19.4	  6.6	  4.4
	SNR0	 26.6	 41.1	 16.4	  9.0
	SNR-5	 55.3	 65.2	 43.0	 25.0

	Mean ratio to HTK base = 74.9%

Although this net does well (better than either net alone), it is still 14% worse than the posterior-combo approach. We've seen this repeatedly (that posterior combination is better than a single large network), but it is certainly curious.

Better rasta baseline? (1999jun07)

Since the rasta side of the combination is so much weaker (and so much worse than the HTK MFCC baseline), I wonder if it can't be improved? I tried using both deltas and double deltas, like the HTK baseline, for 27 elements per feature.

        ras8+d+dd-i1 pdp=0.15

	WER%	Hall	Babble	Train	Car
	CLEAN	  2.1	  2.1	  1.9	  2.5
	SNR20	  2.5	  5.1	  1.7	  1.6
	SNR15	  3.7	  6.6	  2.4	  2.2
	SNR10	  6.9	 11.1	  4.0	  3.0
	SNR5	 15.1	 21.9	  8.1	  5.6
	SNR0	 40.3	 42.1	 24.9	 18.9
	SNR-5	 78.8	 65.0	 62.9	 54.5

        Mean ratio to HTK base = 102.9%

This is a bit more respectable, but it's not great. It's doing significantly worse than HTK/MFCC at the best SNRs.

Plain PLP features (1999jun09)

In Broadcast News, we were surprised to find that per-utterance-normalized 12th-order PLP features were the best choice, significantly better than RASTA. So it was worth testing them here, even though 12th order is typically excessive for telephone-bandwidth data.

	plp12N-i1 (retrain based on i0 alignment)
        WER%    Hall    Babble  Train   Car
        CLEAN     2.8     2.9     2.6     2.9
        SNR20     2.5     3.5     2.1     2.3
        SNR15     3.7     5.1     2.3     2.5
        SNR10     7.0    10.5     3.8     2.9
        SNR5     14.6    24.4     7.8     5.3
        SNR0     33.4    49.9    20.3    10.9
        SNR-5    58.8    74.6    50.7    25.9
			Mean ratio to HTK base = 104.2%

I carried out embedded retraining through 4 iterations, but this was the best result. This is for 13 input units, so it's about half the number of parameters of the similarly-performing ras+d+dd net.

PLP plus deltas (1999jun09)

Since the PLP features seemed to be doing well, I tried a net including their deltas, for an overall geometry of (9x26):480:56, again through 4 iterations of embedding. The third iteration was best, marginally:

        plp12N+d-i2
	WER%	Hall	Babble	Train	Car
	CLEAN	  2.5	  2.8	  2.4	  2.5
	SNR20	  2.1	  3.3	  1.5	  2.0
	SNR15	  2.9	  4.3	  1.9	  2.1
	SNR10	  5.2	  8.6	  3.2	  2.6
	SNR5	 12.1	 19.0	  6.9	  4.2
	SNR0	 28.1	 42.8	 16.9	  8.5
	SNR-5	 53.2	 69.5	 47.9	 21.0
		 	Mean ratio to HTK base = 89.6%

That's pretty good - compares nicely to the msg3N net at 87.1%, with a similar number of parameters. So let's try them together...

plp12Nd and msg3N posterior combo (1999jun09)

As before, this system operates by running separate networks on the msg and plp features, then averaging the log posterior probabilities:

        msg3N+plp12Nd-pdp4 :
        WER%    Hall    Babble  Train   Car
        CLEAN     1.2     1.4     1.2     1.3
        SNR20     1.4     2.0     1.3     1.1
        SNR15     1.8     3.1     1.4     1.2
        SNR10     3.5     6.6     2.3     1.8
        SNR5      9.3    16.4     5.0     3.1
        SNR0     23.3    39.1    14.7     6.6
        SNR-5    49.3    67.0    40.9    18.3
                        Mean ratio to HTK base = 60.6%

I did a search over the phone_deletion_penalty parameter to balance the insertions/deletions for the best result. Although I had been running with pdp=0.15 as default, the curve was pretty flat. A decode with pdp=0.15 gave a 63% HTK overall value.

Results summary (1999jun29)

This collection of overall ratio to HTK WER%s for a set of different nets and combinations gives some indication of the relative value of feature and probability combination. "-" indicates features combined by feeding into a single net, "+" indicates posterior combination of the output of separate nets (thus, "-" binds more tightly than "+"):

        plp12N                          105.9%
        dplp12N (deltas)                125.6%
        plp12N-dplp12N                   89.6%
        plp12N+dplp12N                   89.7%

        msg3aN (0-8 Hz)                 112.7%
        msg3bN (8-16 Hz)                141.6%
        msg3aN-msg3bN                    85.8%
        msg3aN+msg3bN                    99.5%

        msg3aN-plp12N                    86.4%
        msg3aN-dplp12N                   87.5%
        msg3bN-plp12N                    78.1%
        msg3bN-dplp12N                   82.6%

        msg3aN-msg3bN-plp12N-dplp12N     74.1%
        msg3aN-msg3bN+plp12N-dplp12N     63.0% **
        msg3aN-dplp12N+msg3bN-plp12N     70.1%
        msg3aN-plp12N+msg3bN-dplp12N     68.1%
        msg3aN+msg3bN+plp12N+dplp12N     76.5%

All the results in the last block have the same four feature streams, combined with different configurations of feature and posterior merging. Posterior merging of an all-plp and an all-msg net appears to perform the best (**), supporting the hypothesis that posterior combination works best for nets based on relatively independent feature streams, and indeed better than feeding the streams into a single net.

Because the cross-feature-type nets performed so well, I thought I would round out all the possible combinations by trying them in mixtures that involve one feature being repeated. That works surprisingly well, although nothing too miraculous:

        msg3aN-dplp12N+msg3bN-dplp12N    74.6%
        msg3aN-plp12N+msg3bN-plp12N      71.2%
        msg3aN-plp12N+msg3aN-dplp12N     76.0%
        msg3bN-plp12N+msg3bN-dplp12N     68.4%

MSG versus MFCC feature comparison (1999jul15)

In order to confirm our suspicions about why the MSG features fared so poorly with the HTK baseline system, I calculated covariance matrices and some feature histograms for the MSG features and the HTK-produced MFCC features. You can see these plots in the companion MSG-MFCC Comparison Page.

Return to HTK (1999jul22)

Although the MSG features are dreadfully far from what HTK expects, it's not hard to modify them to remove correlations or to adjust their distributions. In order to test this as a way to help HTK, we needed a baseline HTK system. I went back to Brian's versions of the scripts to begin to learn about using HTK; I re-ran the MFCC baseline and reproduced the base results; execution took some 9 hours on half of our 300 MHz Ultra-60. I then converted the 28 element msg3N features into HTK format and ran a system based on those. This took about 12.5 hours, which doesn't include feature calculation. The results, below, are worse than Brian's HTK-MSG result: Brian had only a 26 element vector, excluding, I assume, the bottom frequency band (which is not informative for telephone speech). Brian's features were also based on his filters lo0_hi8_n21_dn5 and lo8_hi16_n21 with agc time constants of 160 and 320 ms, which is the same as I am using.

Here are my msg3N/HTK results:

        msg3N/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   5.7     6.3     5.7     6.4
        SNR20   5.4     7.3     5.0     5.0
        SNR15   7.3     12.4    6.4     5.1
        SNR10   11.6    20.7    9.6     5.9
        SNR5    23.3    42.0    18.4    8.9
        SNR0    47.7    67.0    40.2    16.5
        SNR-5   72.3    85.4    67.8    38.4
                        Mean ratio to HTK base = 205%

Next I tried orthogonalizing the MSG features via the Karhunen-Loeve transform, as implemented by pfile_klt. This training took 13.5 hours, but, surprisingly, didn't help at all:

        msg3N+KLT :
        WER%    Hall    Babble  Train   Car
        CLEAN   6.9     6.7     7.2     7.2
        SNR20   5.4     7.4     5.1     5.5
        SNR15   7.9     11.2    5.8     4.8
        SNR10   14.3    19.7    9.0     5.5
        SNR5    32.4    37.7    19.4    9.6
        SNR0    64.9    61.4    40.1    23.4
        SNR-5   86.7    80.5    69.2    52.4
                        Mean ratio to HTK base = 220.5%

I also tried training a neural net on the features to see how they did:

        msg3NK-i1 :
        WER%    Hall    Babble  Train   Car
        CLEAN     1.6     2.4     1.9     2.3
        SNR20     1.8     3.2     2.1     2.0
        SNR15     2.8     5.0     2.5     2.1
        SNR10     5.6     9.1     4.1     2.6
        SNR5     12.9    19.6     8.1     5.1
        SNR0     30.8    42.6    19.6    10.2
        SNR-5    60.1    67.5    44.5    26.4
                        Mean ratio to HTK base = 89.5%

That compares to an overall ratio of 87.1% for the basic msg3N features. All very confusing - I'll be looking out for a bug.

Gaussianizing the global PDF (1999jul29)

Just applying a global decorrelation to the msg features doesn't seem to help with HTK (actually, it hurts), and doesn't make much difference for the net (which is what I would have guessed, but nice to see). So perhaps the highly non-Gaussian distributions are a problem? We can of course apply a static monotonic nonlinear mapping to each channel to make a histogram of the values look Gaussian, and Jeff's pfile_gaussian -d 5 -h 2000 does just that. I modified it to be able to read in and re-use a previously-defined mapping, and applied it, first to the KLT-transformed features of the previous test (note that separate nonlinear transforms will generally mess up the perfect decorrelation of the KLT). Here are the results of using these features with HTK and with a neural net:

        msg3NKG/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   5.6     5.1     5.7     6.0
        SNR20   4.4     5.9     4.4     4.3
        SNR15   6.0     10.4    4.9     4.4
        SNR10   12.1    19.6    7.9     4.9
        SNR5    25.2    36.9    16.9    7.1
        SNR0    55.0    59.4    35.2    17.3
        SNR-5   81.5    78.7    64.0    42.9
                        Mean ratio to HTK base = 184.5%

That's a surprise - a very simple mapping, adjusting the global feature histograms, improves the overall WER merit ratio by something like 20% relative.

The NN-based results are less exciting:

        msg3NKG-i3 :
        WER%    Hall    Babble  Train   Car
        CLEAN     2.0     2.5     1.9     2.4
        SNR20     1.9     3.5     2.3     2.3
        SNR15     2.8     5.4     2.6     1.9
        SNR10     5.2     9.2     4.1     2.7
        SNR5     12.6    21.8     8.2     4.9
        SNR0     32.2    44.1    19.9    11.1
        SNR-5    60.2    69.5    47.7    27.4
                        Mean ratio to HTK base = 93.3%

This is a little worse than pre-Gaussianizing, but probably not significantly different.

Gaussianized msg3N (1999aug03)

Having tried first decorrelating the msg features with the KLT, possibly followed by Gaussianizing, the alternative is to Gaussianize the unmodified msg3N, possibly followed by decorrelation:

        msg3NG/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   5.9     5.8     5.4     6.4
        SNR20   3.9     6.3     4.2     4.6
        SNR15   5.5     9.8     5.7     4.8
        SNR10   9.4     21.5    8.5     5.2
        SNR5    19.6    42.7    17.9    8.1
        SNR0    45.3    71.5    40.0    15.9
        SNR-5   76.5    92.7    69.4    41.0
                        Mean ratio to HTK base = 190.1%

So, this is a little better than msg3N and a tiny bit worse than msg3NKG, but not much different. For the neural net:

        msg3NG-i1 :
        WER%    Hall    Babble  Train   Car
        CLEAN     1.7     2.4     2.2     2.1
        SNR20     1.8     3.1     1.8     1.8
        SNR15     2.4     5.2     2.2     1.9
        SNR10     4.9    10.1     3.5     2.3
        SNR5     12.5    23.4     7.7     4.2
        SNR0     30.8    47.8    21.1    10.5
        SNR-5    61.0    71.1    48.7    28.5
                        Mean ratio to HTK base = 87.7%

That's barely different from the 87.1% ratio for plain msg3N.

Finally, we can apply the KLT decorrelation to these Gaussianized features, to see how it works in this order. First, for HTK:

        msg3NGK/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   6.5     6.6     6.5     7.0
        SNR20   4.5     8.1     4.2     4.3
        SNR15   6.7     13.6    5.4     4.2
        SNR10   13.1    23.4    9.9     5.1
        SNR5    29.3    41.9    21.2    8.7
        SNR0    59.9    64.5    43.7    20.6
        SNR-5   85.1    84.0    69.8    50.0
                        Mean ratio to HTK base = 210.3%

That's pretty disappointing it has actually erased the benefit of plain Gaussian remapping. Next, for the Neural Net modelling:

        msg3NGK-i1 :
        WER%    Hall    Babble  Train   Car
        CLEAN     1.9     2.5     2.4     2.4
        SNR20     2.1     3.2     2.1     2.1
        SNR15     2.7     5.6     2.3     2.0
        SNR10     4.9    10.0     4.1     2.6
        SNR5     13.0    22.6     8.3     4.6
        SNR0     31.3    45.0    19.9    10.9
        SNR-5    62.8    69.9    48.6    28.8
                        Mean ratio to HTK base = 93.3%

That has also suffered a little from decorrelation, although not very significantly. Still, it's supposed to help!

Using NN posteriors as HTK features (1999aug09)

Discussing this problem with Hynek Hermansky and Sangita Sharma of the Oregon Graduate Institute, they mentioned that they had obtained surprisingly good results using the posterior-probabilities generated by neural-net acoustic models as input features into the HTK system. Since the vocabulary is so limited, it turns out that only 24 of the 56-element posterior probability vector are ever used, so I produced a 24-element `feature vector' for HTK consisting of the posterior probabilities from my msg3N+plp12Nd NN model (my best performing system):

        HTK/lna1 :
        WER%    Hall    Babble  Train   Car
        CLEAN   0.8     1.2     1.1     1.2
        SNR20   1.1     1.8     1.2     1.1
        SNR15   1.9     2.7     1.5     1.3
        SNR10   3.9     5.6     2.2     1.9
        SNR5    9.8     13.9    5.7     3.4
        SNR0    24.6    34.2    15.8    8.0
        SNR-5   53.2    65.4    43.7    22.1
                        Mean ratio to HTK base = 59.1%

This works amazingly well, performing even a little better than the best decoding I had obtained within the connectionist system (of 60.6% avg). This is particularly impressive when you consider that these linear probabilities are absolutely bounded by 0 and 1, and are highly skewed, spending most of their time around zero.

Next I tried using the plain logs of the probabilities, to get a slightly more uniform distribution (although see the companion distribtions page to see how little). This gave a tiny improvement, implying that the distributions weren't much better, or perhaps that HTK doesn't care:

        lna1L/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   0.9     0.7     1.1     0.9
        SNR20   1.4     1.7     1.0     1.0
        SNR15   2.1     2.8     1.3     1.1
        SNR10   4.4     6.0     2.4     1.8
        SNR5    11.0    15.4    5.7     3.6
        SNR0    27.4    39.7    15.9    8.8
        SNR-5   57.0    68.1    44.9    25.1
                        Mean ratio to HTK base = 58.6%

I thought maybe these would be sufficiently spread out to be able to remap to a Gaussian distribution, but the heavy skew at the lowest possible value means that this just didn't work very well. Here are the results:

        lna1LG/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   2.1     2.1     1.9     2.0
        SNR20   2.4     3.0     2.3     2.4
        SNR15   4.2     4.2     2.6     2.9
        SNR10   7.4     8.8     3.9     3.7
        SNR5    15.9    21.0    8.2     5.4
        SNR0    36.4    47.7    21.8    12.5
        SNR-5   65.9    74.7    51.1    32.1
                        Mean ratio to HTK base = 99.8%

Linear output layer net output features (1999aug12)

Rather than attempting to remap the skewed and quantized posterior probabilities, Jeff Bilmes suggested I use the values of the net immediately prior to the softmax nonlinearity/normalization. This gave pleasantly smooth statistics and seems to work well for HTK. However, some further smartness must be devised to use it for combinations rather than simply for individual nets. Here are the results for the plp12Nd net outputs, with a linear outuput layer, used as 24-element features to the HTK system:

        lin-plp/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   1.6     1.8     1.8     1.5
        SNR20   1.8     2.4     1.4     1.1
        SNR15   2.9     3.6     1.8     1.5
        SNR10   5.4     7.1     3.3     2.0
        SNR5    11.9    16.9    6.5     4.1
        SNR0    27.1    40.4    17.5    8.7
        SNR-5   50.8    71.3    45.0    22.5
                        Mean ratio to HTK base = 73.8%

The linear outputs of the msg3N network perform similarly well:

        lin-msg/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   1.5     1.8     1.6     1.4
        SNR20   1.5     2.1     1.6     1.4
        SNR15   2.4     3.8     1.9     1.7
        SNR10   4.0     7.9     2.6     2.1
        SNR5    9.7     18.7    5.9     3.5
        SNR0    25.4    40.8    16.8    8.6
        SNR-5   54.0    69.3    43.2    23.9
                        Mean ratio to HTK base = 71.4%

Linear-output combination for HTK (1999aug13)

Although it's really just a stab in the dark, I think it makes sense to simply add the vectors from the linear-output nets. The softmax implies a free parameter of a constant offset in each output layer, but as long as it's consistent across samples, the HTK classifier will not be disturbed. Anyway, I added the lin-msg and lin-plp features (using the newly-created pfile_ftrcombo) and trained an HTK model:

        lin-sum/HTK :
        WER%    Hall    Babble  Train   Car
        CLEAN   0.8     1.0     1.0     0.6
        SNR20   1.3     1.6     1.1     0.7
        SNR15   1.9     2.4     1.2     0.9
        SNR10   4.0     5.2     2.3     1.4
        SNR5    9.3     13.6    5.0     2.8
        SNR0    23.2    36.4    13.8    6.7
        SNR-5   48.9    67.5    40.4    19.6
                        Mean ratio to HTK base = 51.6%

This is our best result yet and I'm very pleased with it!

Using Mel-frequency cepstral coefficients with a neural net (1999aug29)

I was preparing a presentation for the RESPITE/SPHEAR workshop based on this material and I realized that it would be an interesting question to reverse the transfer i.e. to see how the hybrid architecture performed using exactly the same acoustic features as the baseline HTK system. So I translated the MFCCs into a pfile (using feacat -ipf htk -opf pfile -list ...) and ran a training. The basic MFCC features have 13 cepstra plus energy, and like the HTK system I augmented this with deltas. I also used a 9-frame context window for an overall net size of 252:480:56. The results are quite good, better than any other single feature I've tested:

        mfccN+d-i0 :
        WER%    Hall    Babble  Train   Car
        CLEAN     2.1     2.2     2.3     2.2
        SNR20     1.8     2.9     1.4     1.7
        SNR15     2.6     4.0     1.8     1.9
        SNR10     5.3     7.3     2.9     2.4
        SNR5     11.1    17.4     7.1     3.9
        SNR0     25.0    41.7    19.3     7.8
        SNR-5    54.6    67.9    52.1    21.6
                        Mean ratio to HTK base = 82.1%

MFCCs exactly like HTK uses (1999nov05)

When I came to write my ICASSP submission on these results, I realized that the most telling comparison between the connectionist and GMM-HTK systems would be to use the exact same features - MFCCs with deltas and double-deltas - and not to use the per-utterance normalization. So I trained one of these:

        mfcc-dd-i0 :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN     1.3     1.9     1.6     1.7   1.6
        SNR20     1.7     2.8     1.8     1.8   2
        SNR15     2.4     3.6     2.4     2.1   2.6
        SNR10     4.3     5.8     3.5     2.5   4
        SNR5      9.8    14.4     6.6     3.9   8.7
        SNR0     31.6    45.8    20.7    18.5   29.2
        SNR-5    86.2    81.6    67.0    56.0   72.7
                        Mean ratio to HTK base = 84.6%

Although this actually performs a little worse overall than the mfccN+d above, as measured by the baseline ratio, it's actually much better in the clean case (and much worse in the very high noise case), which is typical of what we see when we discard per-utterance normalization. I don't know how much difference the extra double-deltas are making.

Training on clean only (1999sep15)

One of the objections raised about the AURORA test procedure is that it is a 'matched' train-test condition - that is, although we are testing on noisy digits, we are training on digits corrupted by those same kinds of noise. A far more difficult condition, similar to what we typically use at ICSI, would be to train on clean data only, then see how well recognition performed on noisy data. I tried this out for the mfcc data mentioned above:

        mfccN+d-tclean-i0 :
        WER%    Hall    Babble  Train   Car
        CLEAN     2.1     2.4     2.4     1.9
        SNR20     5.7     9.9     3.7     2.8
        SNR15    10.2    18.7     6.5     4.3
	SNR10    18.5    33.8    12.7     6.4
        SNR5                     25.8    11.2
        SNR0                                 
        SNR-5                                 
                        Mean ratio to HTK base (avai pts) = 221.2%

The test missed a lot of points because sometimes very poor acoustics can crash the decoder, but we can see already by this point that it's performing horribly. So I didn't pursue it any further.

Variants of PLP-based features (1999oct14)

I went up to OGI to discuss AURORA-type things with Hynek Hermansky, Sangita Sharma and others in their group. One thing that emerged was that the evaluation rules set a maximum processing latency (of a few 100 milliseconds) which precludes per-utterance normalization. So I wanted to investigate the effects of using or not using normalization. Also, if you don't use normalization, the energy feature (C0) is, in general, not reliable (since it will reflect an arbitrary overall gain of the signal) so it should help to exclude it (but to use its deltas of course). Finally, the OGI people had used double-deltas too although I was just using deltas, so I wanted to take a look at that. Altogether, 3 independent binary dimensions (segment normalization, exclusion of energy, double deltas) for a total of 8 possible test conditions. Here are the bottom lines i.e. the grand ratios to the HTK baseline results. "N" means per-utterance normalized (mean & variance); "-e" means omitting the direct energy term (but retaining its deltas); "dd" is deltas plus double-deltas. These are all from the boot iteration of embedded training, hence "-i0". (Embedded training is gaving little or no benefit for this task):

The surprising result is that both omitting the energy term and per-segment normalization hurt in these tests. On reflection, per-utterance normalization is not such a good idea for very short utterances, since they don't have enough time to establish a reasonable average. And it has also been noted that the AURORA data has a very tightly normalized overall energy (as it happens - TIDIGITS include more roving), so perhaps that's why retaining the energy feature helps.

For completeness, here's the per-condition breakdown of the best system, unnormalized double-deltas including energy:

	plp12dd-i0 :
	WER%	Hall	Babble	Train	Car	avg
	CLEAN	 1.5     2.0     1.6     1.5     1.7
	SNR20    1.8     3.0     1.5     1.6     2
	SNR15    2.8     3.7     2.1     1.6     2.6
	SNR10    5.1     6.1     3.3     2.1     4.2
	SNR5    10.1    15.3     5.9     3.6     8.7
	SNR0    26.9    48.8     17.2    20.1   28.3
	SNR-5   79.0    82.6     61.4    60.5   70.9
                         Mean ratio to HTK base = 82.4%

It's worth looking a little closer at the effects of segment normalization. For instance, compare the above results with the same system including per-utterance normalization, as shown below:

        plp12ddN-i0 :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN     2.4     2.4     2.4     2.3   2.4
        SNR20     1.8     3.1     1.6     1.9   2.1
        SNR15     2.7     4.7     1.7     2.1   2.8
        SNR10     5.2     9.0     3.0     2.5   4.9
        SNR5     11.1    20.5     6.8     4.0   10.6
        SNR0     27.7    44.3    17.0     8.1   24.3
        SNR-5    53.7    71.6    48.0    21.0   48.6
                        Mean ratio to HTK base = 86.7%

What's interesting is to see that per-utterance normalization hurts quite a lot in the clean case (which probably explains why most of my neural net systems have done worse than the HTK baseline for these conditions) but does indeed help in the very high noise cases, as you would expect it would. Put another way, when the features are highly informative and reliable (clean data), you're better off not modifying them at all, but if they have a lot of unpredictable noise added, segment normalization can offer some help.

Modspec without utterance normalization (1999oct19)

By the same token, I needed to rebuild my msg networks without using per-utterance normalization. This also turned out to be a win:

        msg3-i0 :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN     1.2     2.1     1.8     1.8   1.7
        SNR20     1.6     2.4     1.9     1.8   1.9
        SNR15     2.6     3.9     2.2     1.8   2.6
        SNR10     4.3     7.6     3.3     2.4   4.4
        SNR5      9.5    19.0     6.1     3.5   9.5
        SNR0     24.6    46.3    16.0     8.8   23.9
        SNR-5    59.4    78.0    44.6    35.9   54.5
                        Mean ratio to HTK base = 78.4%

This compares to a mean ratio of around 86% for msg3N, although again normalization helps a little in the very high noise cases.

Orthogonalization of linear net outputs (1999oct16)

Back to HTK. Another thing I learned at OGI was that they had found PCA to be valuable when applied to their log-probability values used as feature inputs to HTK. To confirm this, I ran the pfile_klt utility over the lin-sum data which had given me my best performing system so far when used as features for an HTK training. And help it does:

        lin-sumK.log :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN   0.8     0.8     0.8     0.6     .7
        SNR20   0.9     1.4     0.9     0.8     1
        SNR15   1.5     2.3     1.1     0.9     1.5
        SNR10   3.1     5.2     1.7     1.3     2.8
        SNR5    8.7     13.0    4.4     2.8     7.2
        SNR0    21.2    34.8    12.9    5.9     18.7
        SNR-5   48.9    65.6    36.8    18.7    42.5
                        Mean ratio to HTK base = 47.2%

This is a small but probably significant improvement over the 51.6% of baseline that these data gave me before KLT orthogonalization, or about a 9% relative improvement - definitely worth having!

LDA and TRAPS features from OGI (1999oct16)

One outcome of my visit to OGI was a plan to work on a combined system to run on the AURORA digits task. They have developed several different highly novel feature types (more details on the web pages of the OGI Anthropic Signal Processing Group), which, like the PLP and MSG features here, can be used as inputs to neural nets trained to phoneme labels. By training their nets to the target labels developed at ICSI, we can obtain more networks which could be combined by "linear network output summation" to generate a single feature vector to hand to HTK. To pursue this, Sangita prepared and transferred the linear net outputs for an LDA-based system and a TRAPS-based system. As a sanity check, I trained simple HTK systems based on both these data sets individually:

First, the LDA features:

	lin-lda.log2 :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN   1.2     1.4     1.5     1.1     1.3
        SNR20   1.6     2.5     1.8     1.3     1.8
        SNR15   2.8     3.2     2.4     1.5     2.5
        SNR10   5.2     5.6     3.9     2.3     4.3
        SNR5    10.9    13.2    7.1     3.9     8.8
        SNR0    24.3    37.1    17.2    15.4    23.5
        SNR-5   55.7    76.2    42.3    53.4    56.9
                        Mean ratio to HTK base = 74.1%

That's a nice performance, very similar to the analogous HTK-on-linear-net-output systems based on plp12dN (73.8% of baseline) and msg3N (71.4%).

Now the TRAPs-based outputs:

	lin-trap.log :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN   3.5     4.3     3.8     3.5     3.8
        SNR20   2.7     4.0     3.2     2.1     3
        SNR15   3.9     5.5     3.6     2.3     3.8
        SNR10   6.5     9.8     5.4     3.3	6.2
        SNR5    12.4    20.7    9.7     5.6	12.1
        SNR0    28.0    44.7    20.3    14.0	26.7
        SNR-5   57.7    74.8    48.8    42.0	55.8
                        Mean ratio to HTK base = 119.5%

That's not such an impressive feature when taken alone, but of course when working with combinations, absolute performance is sometimes not as important as being able to pick out different kinds of information in critical situations. So let's see how these two features do in combination - that is, by summing the linear-output-layer network outputs that were used as features in the previous two cases. For good measure, I also applied a KLT orthogonalization to the summed data before training the HTK system:

	lin-ltK.log :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN   1.2     1.2     1.1     0.9     1.1
        SNR20   1.1     1.5     1.4     0.7     1.2
        SNR15   1.9     2.7     1.9     0.8     1.8
        SNR10   3.9     5.2     2.5     1.1     3.2
        SNR5    9.0     13.5    5.4     2.8     7.7
        SNR0    21.4    34.0    14.1    7.4     19.2
        SNR-5   48.5    67.0    38.4    27.2    45.3
                        Mean ratio to HTK base = 55.5%

That's a huge win over even the LDA feature taken alone, showing that the TRAP-based information is helping a lot (assuming we're only getting about 10% relative improvement due to the KLT). So, how about combining all 4 systems (i.e. my PLP and MSG features plus the two OGI features) into one big linear sum? (with KLT too of course). (note that this actually uses the linear outputs of the non-normalizing plp12dd-e and msg3 models):

	lin-sum4K.log :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN   0.7     0.9     0.8     0.7     .8
        SNR20   1.0     1.2     0.9     0.7     .9
        SNR15   1.5     1.8     1.2     0.7     1.3
        SNR10   3.1     3.8     2.1     1.1     2.5
        SNR5    8.6     10.6    4.9     2.4     6.6
        SNR0    22.4    29.8    14.0    7.8     18.5
        SNR-5   57.1    60.0    40.9    30.0    47
                        Mean ratio to HTK base = 46.9%

Even though it's our best number yet, that's a bit disappointing - it's just a hair better than the post-KLT PLP+MSG linear output sum, yet it's got something like twice as much information going in, which always helped in the past. Maybe the rather ad-hoc linear-net-output sum approach is failing us for multiple nets. Or perhaps we've reached the limit of extracting different kinds of information from this net, or at least the LDA+TRAP system is contributing little or no extra information over the PLP+MSG system.

One possibility is that the significantly weaker TRAP network is spoiling the mix at this level of performance, so I tried a 3-way sum consisting of PLP, MSG and OGI-LDA features i.e. just like above but without the TRAPS features:

	lin-pmlK.log :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN   0.8     0.8     0.8     0.6     .7
        SNR20   0.9     1.4     0.9     0.9     1
        SNR15   1.5     2.0     1.3     1.0     1.5
        SNR10   3.2     4.1     2.3     1.5     2.8
        SNR5    9.2     10.9    5.5     2.8     7.1
        SNR0    23.0    30.0    14.4    8.6     19
        SNR-5   57.1    58.5    40.6    31.8    47
                        Mean ratio to HTK base = 49.9%

That's than the plp12dN+msg3N baseline of 47.2% from above, although combining the plp12dd-e and msg3 features actually used in this combo gives only 51.4% of baseline, so the LDA features are helping a little. I should switch to the best-performing plp12dd, but here I'm quoting the earlier plp12dd-e for comparability.

Just out of curiousity, here are what these four 'feature' streams look like, both as the posterior probabilities for which the component networks were trained, and as the linear outputs that are actually being summed to construct the features for HTK:

Comprehensive investigation of linear output summations (1999oct27)

Since the process of using sums of linear network outputs as feature inputs to the HTK Gaussian mixture system has such confusing results, I decided to try a more systematic investigation of the different combinations of the four major network output streams, based on the following features: plp12dd (unnormalized, deltas, double-deltas and energy), msg3 (unnormalized), and the two OGI streams, LDA and TRAPs. There are 15 non-null combinations, of which I tried only some. In all cases, neural networks trained with softmax outputs to generate posterior probability outputs were run with linear output layers, and those linear outputs from corresponding context-independent phoneme classes simply summed between different models to give a feature vector which was then orthogonalized with a full-rank Karhonon-Loeve transform before being used as a 24-dimension feature vector upon which to train the HTK baseline system. The results below summarize with just the overall average per-condition WER ratio to the HTK baseline:

Thus, even though the TRAP system is more than 50% worse than LDA when tested alone, it gives roughly a 10% relative improvement on the plp+msg system, when the LDA system actually worsens the high-performing systems to which it is added, indicating some kind of incompatibility between the features, at least in the critical boundary cases which contribute to these rather small differences in performance. The 44.9% figure is a nice result, however. The complete result matrix is given below:

	lin-p2mtK.log :
        WER%    Hall    Babble  Train   Car     avg
        CLEAN   0.6     0.8     0.8     0.6     .7
        SNR20   0.9     1.2     0.9     0.7     .9
        SNR15   1.5     1.8     1.2     0.8     1.3
        SNR10   3.4     4.3     2.1     1.0     2.7
        SNR5    8.0     11.3    4.6     1.9     6.5
        SNR0    20.5    30.6    12.1    6.6     17.5
        SNR-5   52.8    61.5    36.7    25.0    44
                        Mean ratio to HTK base = 44.9%

Effect of network size (1999oct25)

It's a little late to be checking this, but there is always the question when using neural net models over the size of the hidden layer size. In truth, I've been using 480 hidden units in all cases for the rather flimsy reason that it's the 'traditional' hidden layer size for our NUMBERS95 systems (based no doubt on proper experimentation), but Aurora is both a smaller vocabulary and a wider range of acoustic conditions, so it should perhaps be larger or smaller than this. I tested two alternatives, bracketing the 480 number at 50% more (720 HU) and the same distance below (on a log scale i.e. 320 HU). These compare to the baseline plp12dd neural network. Results:

So, a 50% increase in weights gives a 3% relative reduction in WER, which is quite small, but the net for which the baseline represents a 50% increase in weights experiences a 9% relative improvement. Ergo, 480 HU is indeed a good compromise between model complexity and performance, at least for this input configuration (phew!).

Recalculation of 4-stream NN combination alternatives (1999nov12)

For my other ICASSP paper, I wanted to report on the comparison between combining PLP and MSG features by posterior combination and by just training a larger net (which has been called "feature combination"). I wanted to get some additional results, but when I came to recalculate, I found I was getting slightly different results. It seems that the influence of which iteration in embedded training I use, and even which set of initial labels I use, is considerable. Although I didn't get to go into this in the paper, here is a somewhat comprehensive set of results. I can't account for the patterns, particularly which iteration and which boot labels do best, but maybe they'll make sense later. The best result in each row is in bold. To remind you, the rasi4+q labels were my original training labels, based in the 4th iterative retraining using Rasta features with the glottal-stop ("q") -including lexicon. The msg3N+plp12Nd labels are a realignment based on posterior combination of msg and plp systems trained from the rasi4+q labels.

rasi4+q labelsmsg3N+plp12Nd labels
Feature combinationitn 0itn 1itn 0itn 1
plp12N114.3%105.6%116.6%113.1%
dplp12N125.6%127.9%128.0%132.9%
msg3aN120.8%112.7%125.1%125.0%
msg3bN141.6%161.0%133.4%141.3%
plp12N*dplp12N98.4%90.4%90.5%89.9%
msg3aN*msg3bN91.7%86.0%88.7%89.3%
plp12N*msg3aN86.5%83.6%
plp12N*msg3bN85.4%79.5%
dplp12N*msg3aN88.0%87.5%
dplp12N*msg3bN86.0%82.6%
plp12N*dplp12N*msg3aN*msg3bN76.2%74.1%69.4%73.3%

I think some of the differences between the two trainings come not from the different boot labels but because I was still playing around with different values of the phone_deletion_penalty when I was doing some of the earlier tests. I now believe that the optimal value for this depends on the error rate of the system - which makes a fair test hard to do - but at least all the recent results have a constant value. I'm rerunning the tests on some of the early rasi4+q-based nets to see if this is true (i.e. to see if I get different results, closer to the second boot/train (or at least not so much better than it) using the current default values of aurora-test-all.


Onward to the Feature Statistics Comparison Page - Page Describing How Bootlabels Were Created.

Back to ICSI RESPITE homepage - ICSI Realization group homepage


Updated: $Date: 2000/06/01 00:23:17 $

DAn Ellis <dpwe@icsi.berkeley.edu>
International Computer Science Institute, Berkeley CA