Gabor Feature Extraction for Automatic Speech Recognition

This page provides articles, filter definitions, software tools, and discussion related to work by Kleinschmidt et al. on automatic speech recognition (ASR) with Gabor feature extraction. This page is being maintained by David Gelbart at ICSI, but the work it describes was mainly done by Michael Kleinschmidt and Bernd Meyer at the Medical Physics research group at the Universitaet Oldenburg. Kleinschmidt was a visitor at ICSI in 2001-2002 which is how the connection to ICSI came about.

Publications related to Kleinschmidt et al.'s work on ASR feature extraction with Gabor filters

Work by Kleinschmidt and his collaborators:

Michael Kleinschmidt, Methods for capturing spectro-temporal modulations in automatic speech recognition, Acustica united with acta acustica, 88(3), p. 416-422, 2002. A pre-publication draft is online here.
Michael Kleinschmidt and David Gelbart, Improving word accuracy with Gabor feature extraction, ICSLP 2002. Available here.
Michael Kleinschmidt, Spectro-temporal Gabor features as a front end for automatic speech recognition, Forum Acusticum, 2002. Available here.
Michael Kleinschmidt, Robust speech recognition based on spectrotemporal processing, PhD thesis, 2002, Universitaet Oldenburg. Available online on the Medical Physics publications page.
Michael Kleinschmidt, Localized spectro-temporal features for automatic speech recognition, EUROSPEECH 2003. Available online in the ISCA Archive.
Bernd Meyer and Michael Kleinschmidt, Robust speech recognition based on localized, spectro-temporal features, Elektronische Sprachsignalverarbeitung (ESSV) 2003. Available online here and at Bernd Meyer's home page.
Bernd Meyer, Robust speech recognition based on spectro-temporal features, diploma thesis, 2004, Universitaet Oldenburg. Available online here and at Bernd Meyer's home page.
Bernd Meyer and Birger Kollmeier, Optimization and evaluation of Gabor feature sets for ASR, INTERSPEECH 2008.
And other work co-authored by Bernd Meyer.

Work by others:

There is an INTERSPEECH 2008 special session titled Auditory inspired spectro-temporal features which contains many related papers. For example, the paper "Multi-Stream Spectro-Temporal Features for Robust Speech Recognition" by S. Zhao and N. Morgan.

For an another approach to Gabor filter analysis in speech processing, see T. Ezzat, J. Bouvrie, and T. Poggio, Max-Gabor Analysis and Synthesis of Spectrograms, ICSLP/INTERSPEECH 2006. A newer paper by the same authors, Spectro-Temporal Analysis of Speech Using 2-D Gabor Filters, is currently (May 2007) available as a submitted paper preprint from Tony Ezzat's home page.

There is work outside the speech processing field which uses Gabor analysis. For example, A. Serrano, I. Diego, C. Conde, E. Cabello, L. Bai, and L. Shen, "Fusion of Support Vector Classifiers for Parallel Gabor Methods Applied to Face Verification", Multiple Classifier Systems, 2007 (link).

Source code

The code which was used to perform feature selection to find good sets of Gabor filters, using the Feature Finding Neural Networks (FFNN) approach of Tino Gramß (Gramss), is available in an open source package in two parts. The code and documentation can be downloaded here or here (for more information on contents see the included file README.TXT). Sample input data from the OLLO corpus can be downloaded here or here (for more information on contents see the included file CORPUS_FILES_README.TXT). See Kleinschmidt's Forum Acousticum paper or PhD thesis for more details about the feature selection process.

A C++ program for calculating Gabor features from an HTK-format input spectrogram can be downloaded here. The archive also contains ASCII parameter files for the Gabor filter sets G1, G2, and G3 used in the ICSLP 2002 paper and the filter set HB02 used in the ESSV 2003 paper. These parameter files can be used by the C++ program as filter set specifications. For more information, see the included file README.TXT. (Updated October 2006: The README.TXT file and the C++ source code have been changed to make it more clear how to fix CPU endianness issues.)

When using the Gabor features, we usually had the best automatic speech recognition results when using "tandem" acoustic modeling in which a multi-layer perceptron stage precedes Gaussian mixture modeling, rather than modeling the Gabor features directly with diagonal-covariance Gaussian mixture models (even in the cases when we tried using KLT or LDA on the Gabor features). For a discussion of some other novel features for which multi-layer perceptrons performed better than diagonal-covariance Gaussian mixture models, see here. When incorporating new features into a system, especially if this results in a change to feature vector length, it may be necessary to re-tune acoustic model configuration parameters, such as the "Gaussian weight", in order to get best performance (see Zhu et al.'s 2004 paper "On using MLP features in LVCSR" for a discussion of Gaussian weight tuning).

Download links for ICSI's multi-layer perceptron software, and ICSI's SPRACHcore package, along with some references about the "tandem" acoustic modeling approach, can be found here.

Plots and parameters of the Gabor filters in the ICSLP 2002 paper

The Gabor set G1 parameter list is here and plots of the Gabor filters are here. In the plots, the vertical axis is frequency and the horizontal axis is time.

The Gabor set G2 parameter list is here and plots of the Gabor filters are here. (The G2 parameter list data that was posted here from March 30, 2002 to June 5, 2002 was incorrect.)

The Gabor set G3 parameter list is here and plots of the Gabor filters are here.

Gabor, TRAPS, HATS

Here is a discussion of how Gabor filter feature extraction relates to the TRAPS and HATS approaches. There is more discussion of the relationship between the Gabor approach and other approaches in the publications listed above.

Other results with Gabor feature extraction

If you try Gabor feature extraction, we'll be interested in hearing about your results, and with your permission we'd like to mention them on this page.

Conversational speech: ICSI ran some experiments on using Gabor feature extraction for conversational telephone speech, starting from a system similar to the one in Morgan et al's ICASSP 2004 paper. Frame-level multi-layer perceptron accuracies for the Gabor feature streams were good, but no significant performance improvement was obtained when adding the Gabor streams to an existing system which used a feature vector based on HLDA of PLP with deltas, double-deltas, and triple-deltas, concatenated with 25 features from KLT of inverse-entropy combination of PLP and HATS MLP outputs (the Gabor streams were added by including the corresponding MLPs in the inverse entropy combination of MLP outputs). It's possible that this was because the combination of PLP and HATS (or various other features; the baseline was complex) in the baseline system already covered a lot of the territory covered by Gabor feature extraction. However, it's also possible that the performance with Gabor features could have been improved. For example, we were using a Gabor filter set which had not been optimized for conversational speech or for use together with PLP and HATS. (For optimization for use with PLP and HATS, one approach would have been to find various Gabor filter sets using FFNN, and then to measure the performance of all of them in a full speech recognition system also using PLP and HATS, and choose the best set. Another approach would have been to try integrating PLP and HATS into the FFNN process, e.g., by placing PLP and HATS features permanently in part of the feature vector used by FFNN and allowing other components of the feature vector to be chosen by the usual FFNN Gabor filter selection process.) This page summarizes where the Gabor filter work at ICSI ended up.