ICSI Speech FAQ:
5.2 What features are commonly used?

Answer by: dpwe - 2000-08-01


Pretty much all the features currently used in speech recognition are based on the short-time Fourier transform magnitude, or equivalently, the energy in a series of limited frequency bands averaged over 20-50ms time windows, calculated every 10-25ms. This seems to give a degree of similarity to (part of) the salient properties detected by the ear, which one way or another seems a good starting point for invariant representations of speech.

Variation between different feature types lies in areas such as the precise definition of those frequency bands, the form of nonlinearity generally used to 'compress' the energy range, and subsequent processing such as smoothing in time and frequency, adaptation/automatic gain control, and then decorrelating transformations.

The feature types you are most likely to encounter are listed below:

FeatureDescription
Mel-frequency
cepstral coefficients
(MFCCs)
These are the workhorse features used most often in speech recognition. (Peversely, they are almost never used at ICSI - we prefer PLP, below). The basic frequency resolution is based on the Mel approximation to the bandwidths of the ear's tuned resonators, which is fixed-bandwidth at low frequencies and constant-Q (bandwidth proportional to center frequency) at high frequencies. These Mel-spectra are then cepstrally transformed (the discrete cosine transform of the log magnitude spectrum) and typically truncated to give a compact (8-16 element), largely decorrelated feature vector capturing the broad features of the original spectrum. MFCC is the standard representation used by the popular HTK toolkit.
Perceptual Linear
Prediction
(PLP)
As devised by Hermansky, PLP features smooth an auditory spectrum (based on the Bark rather than Mel approximation, and compressed by cube-root rather than logarithm) by fitting an autoregressive (all-pole or linear-prediction) model, which has the nice property of modeling spectral peaks more carefully than the less-reliable valleys inbetween. It is basically rather similar to MFCCs, but often marginally out-performs them in our experience.
Rasta-PLP Rasta (for Relative Spectral Transformation or something like that) was introduced by Hermansky and Morgan as an enhancement of the basic PLP features. Subband energies are band-pass filtered in time (the so-called Rasta filter) in the log domain. The high-pass component of this filter normalizes away any average level, giving robustness to channel characteristics. The low-pass part smooths along time, giving better generality. Rasta works well for telephone tasks with lots of channel variability, but we have seen plain PLP performing better on recent tasks such as Broadcast News and Aurora.
j-Rasta This variant of rasta performs filtering in the log(E)+k domain, which is channel-normalizing like conventional Rasta when the energy E is large, but tends toward normalization of fixed energy offsets (e.g. stationary background noise) when E is small. The trick is in setting the constant term, k (also known as j for j-Rasta) to set the transition at the right point, which needs a background noise estimate. I think largely due to this complication, j-Rasta hasn't been used much at ICSI in recent years, but Andy Morris at IDIAP has used it quite a lot with good results.
Modulation-filtered
spectrogram
(MSG)
Brian Kingsbury's thesis work took the success of Rasta processing and tried to reconstruct it based on solid auditory principles, and optimize it through extensive empirical testing. The MSG features filter subband energies are filtered to emphasize particular modulation frequencies that have been shown experimentally to be crucial to speech perception. The spectra are then adaptively normalized through automatic gain control. Brian's thesis showed these features to be helpful in reverberation and noise, and indeed better than other features for NUMBERS95 recognition. In other tasks, they typically give worse baseline recognition, but are excellent candidates for combination in multistream systems, where their distinct difference from standard features (more sluggish, different scaling) makes them a very complementary information source. MSGs classically do not include a final decorrelation transformation, so the representation is typically in two bands of spectral coefficients.

Basic feature types admit of variants such as the number of parameters used (effectively, the degree of detail in cepstral coefficients), whether delta and double-delta (first and second order derivative) features are appended, different kinds of normalization (mean, variance, per-utterance, online), sampling rate of underlying waveforms, frame rate, window size, and many others specified as options to feature calculation programs like feacalc.


Previous: 5.1 What are features? What are their desirable properties? - Next: 5.3 How do you calculate rasta and/or plp features?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009