by David Gelbart
Here are my my thoughts about how the tandem Gabor approach relates to the TRAPS approach and to the HATS approach introduced by Barry Chen et al. at EUROSPEECH 2003. This is a slightly edited version of an email I sent out to collaborators. Because of its origin as an internal email, it may not be completely clear to all readers. Feel free to contact me if you would like any clarifications, or if you would like to know about the current status of any of the work described here.
The HATS approach is a descendant of TRAPS. In TRAPS a set of multi-layer perceptrons (MLPs) is trained, with each MLP having as input Mel- or Bark-scale spectral energy/magnitude values (computed at, say, a 100 Hz frame rate) in one critical band over a long time trajectory (say, 1 s), and then a merger MLP combines the decisions of the critical band MLPs. In tri-band TRAPS, each MLP in the set looks at three adjacent critical bands instead of one critical band -- this sometimes improves performance.
In HATS and tri-band HATS, the merger MLP doesn't combine the outputs of the MLPs in the set. Instead, after training the MLPs in the set, only their hidden unit values are used (the output units are ignored). So, the merger MLP takes as input all the hidden unit values. (If you have the Duda and Hart book it has a nice discussion of how the weights on the links from the input units to a hidden unit in an MLP can be seen like a pattern mask for patterns the hidden unit is sensitive to.)
Meanwhile, in Michael's Gabor filtering approach, a set of two-dimensional filters is picked by the FFNN feature selection process, which starts with a random feature vector, and then uses a linear net (linear perceptron) to rapidly consider possible changes to the feature vector, performing repeated random replacement of the filter in the set which appears the least useful for classifications. We have been applying these filters to Mel-spectrograms (the same things that the TRAPS approaches are looking at in thin temporal slices) and feeding the results to an MLP for classificaiton.
So, both Gabor filtering (used with an MLP) and tri-band HATS are a two-stage process in which first a discriminatively trained set of spectrotemporal masks (Gabor filters or input-layer-to-hidden-unit weights) is applied to the signal, then the result of this application is passed to an phonetically-classifying MLP (Gabor tandem MLP or HATS merger MLP).
In fact, I wonder if Barry was partly inspired by the tandem Gabor approach when he created the HATS? (UPDATE: No.)
This connection between tandem Gabor and tri-band HATS is relevant to me because in the fall I was trying to improve the performance of a multi-stream system which already included PLP and HATS or TRAPS by adding a Gabor stream. My thought at the time was that this should work because the Gabor stream is doing a different kind of classification. But maybe it is not different enough, because of the similarities mentioned above. (Frame accuracy did go up significantly, but word accuracy only went up by a tiny amount, and if the tri-band HATS were used maybe the change in frame accuracy would have been less.) It's still true that the Gabor filters can go across any number of frequency bands, while the HATS patterns only go over three bands, and PLP or MFCC uses all bands at once, but I am not sure how much this matters, especially considering that the HATS merger MLP is looking at the output of more than one HATS hidden unit activation simultaneously.
So maybe the best track for me to take regarding using the tandem Gabor in that multi-stream system is to focus on something more specific to the Gabor approach. For example, the idea of selecting Gabor filters that are complementary to some information in the non-Gabor streams (e.g., PLP features) by keeping that information in the feature vector during the FFNN feature selection, so that the feature selection picks Gabor filters that are complementary to it. This seems like a way to use the FFNN approach to feature selection to pick features to give complementary information, and I don't see any way to do that in HATS training (hmmm, perhaps by using boosting to re-weight the training labels?).
It's interesting that we have an evolution of TRAPS variants which got more similar to the Gabor approach over time: first the TRAPS (temporal), then the tri-band TRAPS (spectrotemporal), and then the HATS and tri-band HATS. You could say that a kind of convergence happened between these TRAPS descendants and the Gabor approach.
Hynek Hermansky told me the TRAPS idea was partly inspired by the long temporal extents of the cortical neuron spectrotemporal response fields in deCharms et al. (1998). I know Michael has cited that work, I wonder if his Gabor filter approach was also partly inspired by that paper? If that's so, we have a nice little historical story of common origin followed by divergence followed by some convergence.
Back to Gabor page.