5.6 What are delta features? How do you calculate them?

One classic criticism of the conventional statistical speech recognition framework is that it is inspired, at least, by a very shakey approximation, that subword units (such as phones) are characterized by static spectra (i.e. particular locations in spectral-feature space). In fact, inspection of almost any spectrogram will confirm that the speech spectrum is in an almost constant state of flux, and it is more sensible to conclude that it is particular patterns of change (dynamics) that characterize phones, and that the basic spectrum to which these changes are applied is almost irrelevent. A wealth of psychoacoustic evidence supports this idea.

The problem is that our statistical tools (such as Gaussian Mixture distribution models, but also neural networks) are best matched to static input patterns, not to classify the contextual variation in an input sequence. But all is not lost: If we can capture the dynamics of a signal in terms of some relatively static parameters, we can apply our classification tools to that domain instead, and everything will be good again.

That is the insight behind delta features, which estimate first and higher-order derivatives of each feature dimension as a way of converting dynamic, contextual behavior into a relatively fixed point in the new feature space. Delta features can make a huge difference in system performance - a typical result is the difference between using plain 12th order PLP cepstra for the Aurora noisy digits task, and the same system including deltas: WER for medium SNR tasks is improved by 20% relative when the deltas are used. Adding double-deltas (acceleration or curvature) as well gives a further 6% relative or so - less dramatic, but still worth having, when the cost of the additional data space and model parameters is acceptable.

Delta calculation at ICSI almost always means convolving the time sequence of each feature dimension with a 9-point impulse response that goes linearly from +4 to -4 (modulo a scale factor). This isn't strictly a differentiator, but does (I think) result in a least-squares linear fit to the points within the window.

Delta calculation can be performed by many programs. Most convenient is on-the-fly delta calculation provided within QuickNet via options ftrX_delta_order (0, 1 or 2) and ftrX_delta_win (9 by default for the 9-point window, but can be any size). However, delta calculation is a somewhat significant computational load since it involves an extra 9 floating-point multiply-adds per point; on the SPERT, where delta-calculation is performed via emulated floating point, this can dominate training time (!). To pre-calculate deltas, use qncopy to precompute just the feature processing stage of QuickNet, or use the delta_order option to feacalc, or use a stand-alone program such as calc_deltas or onl_calc_deltas.

When using the multispert setup, the feature processing is done on the host CPU (with its fast floating point), and the SPERT boards are used only for the net updates (at which they excel), so this is a very good arrangement for online delta calculation within QuickNet. This justifies the existence of the one MonoSPERT (HOP, which currently holds a single MultiSPERT-capable board).

Previous: 5.5 What kinds of normalization are there? How do you calculate them? - Next: 5.7 How can I create my own novel features?

Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:16 PDT 2009