Although our experiments with neural-network classifiers have shown the modulation-filtered spectrogram features (MSG) to be rather useful, when we use those same features with a standard HTK system (i.e. a Gaussian mixture acoustic model), the performance is significantly inferior to the standard Mel-frequency cepstral coefficient (MFCC) features. Why should this be?
One of the advantages of using neural network (NN) classifiers is that they are much less sensitive to the statistical properties of features than Gaussian models. In particular, NNs are relatively unperturbed by correlation among feature channels, whereas Gaussian models often assume uncorrelated features (i.e. 'diagonal covariance') to improve efficiency. Although Gaussian mixture models can in theory match any distribution, a finite number of mixture components will make a better job of modelling a more Gaussian distribution. The MSG features are spectral, that is, each channel is based on the energy in a specific frequency band, meaning that adjacent channels are likely highly correlated, since energy is typically spread across more than one frequency band.
A more subtle assumption of Gaussian models is that the underlying data distributions are Gaussian. When the data has a significantly non-Gaussian distribution, the likelihoods will be systematically biassed. NNs have no such assumption, so should be able to adapt to a wide range of distributions.
To confirm suspicions that these factors were at work, I calculated the covariance matrix and some feature histograms for both the MSG features and the HTK-generated MFCC features. (Note to local users: to get the HTK features into Matlab, I adapted libquicknet to support reading 'compressed' HTK files, which I think worked, and I adapted feacat to accept a list of input file names). I used a few hundred of the clean AURORA digits utterances for about 100,000 feature vectors to calculate these statistics.
The covariance matrices are rather stark. The MFCCs seem almost perfectly uncorrelated (although there are small residual off-diagonal correlations that don't show up in the plot, and, as Jeff pointed out, it is really the class-conditional correlations that matter, not the grand average). The MSG features are very strongly correlated, both in a central band (as we expect from their spectral nature), but even far off-diagonal in a complex and interesting structure. Clearly, diagonal Gaussians used to model this data would be sorely inadequate. However, using the discrete cosine transform (or, better yet, the Karhunen-Loeve transform) can easily decorrelate the data (e.g. with pfile_klt).
The feature histograms are more of a surprise. Firstly, the MFCC distribution doesn't look exactly Gaussian - it seems to be tending towards a rectangular distribution. Some of the other channels looked even more flat in the middle. The MSG channel though is much more complex: element 2 is close to bimodal. Other channels varied, although the low-frequency channels all had this strong bimodal tendency around zero, reflecting the voiced/unvoiced dichotomy of natural speech. Even the higher channels, with a more symmetric, unimodal structure, were still very 'peaky', with long tails, making a Gaussian a poor fit. These irregularities can be corrected using a nonlinear monotonic transformation based on the actual histograms, e.g. as performed by pfile_gaussian.
To remove the correlation problem, I used Jeff's excellent pfile_klt to perform an orthogonalization of the data. I perfectly orthoganalized the training data, then used the same matrix on all the test sets. Here are the resulting covariance and histogram for the first 20,000 frames of the CLEAN test set after this processing. Note that the covariance is basically uncorrelated, and the histogram actually looks better behaved, although still much more peaky than a Gaussian.
And here are the first six eigenvectors (the basis vectors accounting for most of the variance in the original data). The vectors have constant offsets of 0, 1, 2... to separate them on the plot:
There are 28 elements in these vectors because I did the KLT over both the msg3a and msg3b banks (i.e. the two modulation-domain filters) at once. This is a little unsatisfying, since it 'yokes' the spectral bases within each set to have the same coefficient. But if I'd done KLTs on msg3a and msg3b separate, there would have been residual correlation between the two decorrelated banks (which makes me think: what is the correlation like between mfccs and delta(mfcc)s - probably not zero!).
You can see that the vectors are pretty close to a DCT set. The first vector just provides an average level. The second vector makes one bank negative while making the other positive (or vice-versa), and the third adjusts the slope within each bank. Vectors 4, 5 and 6 look somewhat like Cosine bases with 1, 1.5 and 2 cycles per each bank.
The next modification to try is the Gaussian warp. Because of space limitations, I tried this on the KLT processed features above first. Here's the resulting covariance and histogram:
Fortunately, the Gaussian warping hasn't hurt the decorrelation appreciably. The histogram looks very Gaussian, as expected. Raggedness comes probably from the smaller sample (10,000 frames) used to calculate these plots.
Here's the Gaussian warp applied directly to the msg3N features:
And finally, here are the plots for that data after a Karhunen-Loeve transform, to decorrelate them:
Unfortunately, as shown on the results page, none of these performed particularly well as recognition features.
Since the neural-net models based on MSG and PLP features were performing so much better than I could get from HTK models, and based on ideas from Sangita Sharma and Hynek Hermansky of OGI, I tried using the posterior probabilities of the 24 distinct phoneme classes, i.e. the output of the neural-net classifiers (or actually the geometric mean of two nets, the plp12Nd and the msg3N ones) as input features to the HTK system. As you can see, they are anything but Gaussian distributed. However, the resulting HTK systems performed very well.
Here are the covariance and example feature histograms for the linear probabilities:
The histograms are a little hard to see: basically, the values vary only between 0 and 1, and spend most of their time very close to zero. I included the h# (i.e. silence or nonspeech) element (23) because it is the only one that spends a significant proportion of its time away from zero, giving the bimodal distribution you see.
Here are the covariance and histograms for log(prob) features i.e. each element is just the log of the set used above:
These don't look so different; the bunch close to p=0 has been spread out a little, but the skew is still extreme. The uneveness of the distributions come from the quantization used in storing this data - it's originally derived from a byte-encoding, linear in the log-probability domain (the well-known "LNA" file format).
I did also try Gaussianizing these with pfile_gaussian, but it fared very badly because of the quantization:
Jeff Bilmes suggested that rather than trying to Gaussianize the softmax-ed net outputs, I should try just the pre-softmax outputs of the net, which I was able to generate by running the forward pass with the final output type as "linear" rather than "softmax" (I had to modify qnsfwd to accept this, but it's supported in the library). These results do look somewhat pleasantly behaved (at least compared to the probability or log-prob values), and they even work well as features to HTK, although it's not clear how best to combine multiple nets (linear addition seems the first thing to try). Here are the stats for the linear output version of the plp12Nd network:
Back to ICSI AURORA page - ICSI RESPITE homepage - ICSI Realization group homepage