ICSI Speech FAQ:
2.3 Why do we use connectionist rather than GMM?
Answer by: dpwe - 2000-07-22
(a/k/a
"why is this night different from all other nights?")
Neural nets have been a well-established technique for probabilistic
classification ever since their invention -- certainly since the development
of the back-propagation algorithm, which provides a way to 'learn' the
weights in a multi-layer perceptron (MLP) to reproduce the outputs as represented
in a body of training examples. At ICSI (specifically by Morgan and Bourlard
in the early 1990s) a particular approach for using neural nets as the
classifiers, or "acoustic models", in speech recognizers was developed:
the so-called hybrid connectionist-HMM model, in which a temporal window
of 9 or so successive feature vectors are presented to the input layer
of a network (typically an MLP with a single hidden layer) whose
outputs estimate the posterior probability over a set of mutually-exclusive
speech classes that the current frame corresponds to each class.
In the years since, the majority of speech recognition systems have
used parametric distribution models to estimate the likelihoods of
particular feature vectors given a particular speech class. King
of the distribution models is the Gaussian mixture model (GMM).
At ICSI, and at the labs to which we have close links (Cambridge, Sheffield,
FPMons, IDIAP and a few others), neural net models have been held on
to despite the flood of techniques and tools (such as HTK) specific to
GMM systems. Much of the reason may be historical i.e. we have continued
along the path we were already on, but there are a number of reasons why
this is a reasonable thing to do:
- Performance: Neural network systems are comparable in
performance to other models. On small vocabulary tasks, there
is little to choose between them. On large vocabulary tasks
such as Broadcast News, our connectionist systems make perhaps
as many as 50% relative more errors than the best systems
(which are GMM based), but the best systems are much more
complicated in many other ways (adaptation, context-dependence etc.).
- Simplicity: The structure of the neural net system is
rather simple. Whereas Gaussian mixture systems typically
rely on a rather fine sub-phonetic state division, and in
consequence have complicated state-tying infrastructure to
maximise training efficiency, a connectionist acoustic model
can be a single neural net with a few tens of outputs, each of
which has a direct interpretation as a particular phone. In a
sense, much of the complexity of context-dependence, subphone
states and parameter sharing is accomplished automatically
within the single net-training operation, and is hidden from
the user.
- Flexibility/forgivingness: Whereas Gaussian models often
perform very differently in response to relatively minor changes
in feature representation, such as the rotation of cepstra
relative to spectra, neural net models are much better at
adapting to quirks in input feature distributions. Correlations,
nonuniformity etc. make little impact. Connectionist systems
perform about the same with spectral and cepstral-domain
representations, and I have not yet found a way to make
modulation-filtered spectrogram features (MSGs) acceptable
to HTK, other than by passing them first through a net.
- Availability of training systems: The key to the success
of the original hybrid connectionist-HMM systems was the
development of the training algorithm that, through back
propagation, was able to train rather large networks to reproduce
fine-scale phonetic target training data, from hand labelers or
forced alignment. Since then, the basic algorithm has been
given a highly optimized implementation in the QuickNet
programs (see common ICSI programs)
and in particular has been ported to our custom T0/SPERT
vector microprocessor systems. This makes it much easier and
faster for us to train neural network acoustic models than for
most people in the world.
- Discrimination: Because the neural net calculates the
posterior probability of all possible classes in a single step,
it can focus on discriminating between classes that
might be most prone to confusion. By constrast, in distribution
models each class is represented by a separate set of parameters
without regard to the other classes, so there is typically no
method to make more precise models of the critical regions of
feature space (discriminative training of GMMs is an active
area of research, however - see the Cambridge 1999 Broadcast
News system).
- Posteriors: The networks estimate true posterior probabilities
rather than the likelihoods (arbitrarily scaled probabilities)
resulting from distribution models. Posteriors present all
kinds of interesting application opportunities, including
confidence estimation, special efficiency tricks in hypothesis
search, and visualization.
For more discussion of this matter, see the following article:
N. Morgan and H. Bourlard, "Continuous Speech Recognition:
An Introduction to the Hybrid HMM/Connectionist Approach."
Signal Processing Magazine, pp. 25-42, May 1995.
Previous: 2.2 What are the basic approaches to speech recognition? - Next: 2.4 What are the different speech corpora at ICSI or elsewhere?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:13 PDT 2009