ICSI Speech FAQ:
2.3 Why do we use connectionist rather than GMM?

Answer by: dpwe - 2000-07-22

(a/k/a "why is this night different from all other nights?")

Neural nets have been a well-established technique for probabilistic classification ever since their invention -- certainly since the development of the back-propagation algorithm, which provides a way to 'learn' the weights in a multi-layer perceptron (MLP) to reproduce the outputs as represented in a body of training examples. At ICSI (specifically by Morgan and Bourlard in the early 1990s) a particular approach for using neural nets as the classifiers, or "acoustic models", in speech recognizers was developed: the so-called hybrid connectionist-HMM model, in which a temporal window of 9 or so successive feature vectors are presented to the input layer of a network (typically an MLP with a single hidden layer) whose outputs estimate the posterior probability over a set of mutually-exclusive speech classes that the current frame corresponds to each class.

In the years since, the majority of speech recognition systems have used parametric distribution models to estimate the likelihoods of particular feature vectors given a particular speech class. King of the distribution models is the Gaussian mixture model (GMM).

At ICSI, and at the labs to which we have close links (Cambridge, Sheffield, FPMons, IDIAP and a few others), neural net models have been held on to despite the flood of techniques and tools (such as HTK) specific to GMM systems. Much of the reason may be historical i.e. we have continued along the path we were already on, but there are a number of reasons why this is a reasonable thing to do:

Performance: Neural network systems are comparable in performance to other models. On small vocabulary tasks, there is little to choose between them. On large vocabulary tasks such as Broadcast News, our connectionist systems make perhaps as many as 50% relative more errors than the best systems (which are GMM based), but the best systems are much more complicated in many other ways (adaptation, context-dependence etc.).
Simplicity: The structure of the neural net system is rather simple. Whereas Gaussian mixture systems typically rely on a rather fine sub-phonetic state division, and in consequence have complicated state-tying infrastructure to maximise training efficiency, a connectionist acoustic model can be a single neural net with a few tens of outputs, each of which has a direct interpretation as a particular phone. In a sense, much of the complexity of context-dependence, subphone states and parameter sharing is accomplished automatically within the single net-training operation, and is hidden from the user.
Flexibility/forgivingness: Whereas Gaussian models often perform very differently in response to relatively minor changes in feature representation, such as the rotation of cepstra relative to spectra, neural net models are much better at adapting to quirks in input feature distributions. Correlations, nonuniformity etc. make little impact. Connectionist systems perform about the same with spectral and cepstral-domain representations, and I have not yet found a way to make modulation-filtered spectrogram features (MSGs) acceptable to HTK, other than by passing them first through a net.
Availability of training systems: The key to the success of the original hybrid connectionist-HMM systems was the development of the training algorithm that, through back propagation, was able to train rather large networks to reproduce fine-scale phonetic target training data, from hand labelers or forced alignment. Since then, the basic algorithm has been given a highly optimized implementation in the QuickNet programs (see common ICSI programs) and in particular has been ported to our custom T0/SPERT vector microprocessor systems. This makes it much easier and faster for us to train neural network acoustic models than for most people in the world.
Discrimination: Because the neural net calculates the posterior probability of all possible classes in a single step, it can focus on discriminating between classes that might be most prone to confusion. By constrast, in distribution models each class is represented by a separate set of parameters without regard to the other classes, so there is typically no method to make more precise models of the critical regions of feature space (discriminative training of GMMs is an active area of research, however - see the Cambridge 1999 Broadcast News system).
Posteriors: The networks estimate true posterior probabilities rather than the likelihoods (arbitrarily scaled probabilities) resulting from distribution models. Posteriors present all kinds of interesting application opportunities, including confidence estimation, special efficiency tricks in hypothesis search, and visualization.

For more discussion of this matter, see the following article:

Signal Processing Magazine

Previous: 2.2 What are the basic approaches to speech recognition? - Next: 2.4 What are the different speech corpora at ICSI or elsewhere?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:13 PDT 2009

ICSI Speech FAQ: 2.3 Why do we use connectionist rather than GMM?

ICSI Speech FAQ:
2.3 Why do we use connectionist rather than GMM?