Since I am interested in using spectral rather than cepstral features, I ran a series of training experiments where I tried a range of different feature styles to see how much the error varied. The surprising result was that the variation was not significant. I don't know if I believe it!
But here are the results anyway. Note that the baseline is rather high for the NUMBERS95 task - about 6.5% is what others get from embedded training. One problem might be the dictionary (pronunciations) I was using, which was some random one I picked up off the floor.
This work was originally mentioned in my status report of 1997oct03. The gory details are in my NOTES file from this work.
Feature set | Net size | dev WERR% (sub/del/ins) |
rasta-plp-cepstra | 243/500/56 | 7.0% (4.0/1.3/1.7) |
rasta-plp-logspectra | 405/500/56 | 6.6% (3.7/1.2/1.7) |
rasta-cepstra | 243/500/56 | 6.9% (3.8/1.4/1.7) |
rasta-logspectra | 405/500/56 | 6.8% (3.9/1.3/1.6) |
plp-cepstra | 243/500/56 | 6.9% (3.7/1.8/1.4) |
plp-logspectra | 405/500/56 | 7.1% (3.8/1.7/1.6) |
cepstra | 243/500/56 | 6.8% (3.7/1.6/1.5) |
logspectra | 405/500/56 | 7.0% (3.8/1.6/1.6) |
5% significance in this test set requires a difference of 0.88% (n=4680), so none of these numbers differ significantly from one another. In each case, I ran 4 iterations of embedded training (i.e. relabelling and retraining) and took the best result.
For your pleasure, here are images of the feature representations under the eight transforms:
Base feature | Cepstra | Log spectra |
rasta-plp | ||
rasta | ||
plp | ||
plain |