ICSI Speech FAQ:
6.10 What is posterior combination? What other kinds of combination are possible?

Answer by: dpwe - 2000-08-11

The basic idea between all combination strategies is that you may have several different approaches to a particular part of your problem - different feature extraction routines, different modeling approaches, different HMM structures - and they perform similarly even though they are doing different things. If you can come up with a simple strategy to use them both within a single system, perhaps you can combine their strengths and ameliorate their individual weaknesses. This is a well-established idea in pattern recognition.

There are many different kinds of combination that could be applied to speech recognition. Su-Lin Wu investigated several in her Ph.D. dissertation. I also wrote a rather superficial description of different combination methods for AVIOS-2000, Improved Recognition by Combining Different Features and Different Systems. You'll find several more papers discussing combinations on the publications page.

The favorite scheme (my favorite anyway) is posterior combination, which is also one of the simplest. Since our acoustic model neural nets all generate outputs in terms of estimates of the posterior probabilities of a common set of phone symbols (assuming they were trained using compatible targets), we can simply average these outputs from different networks (which might be trained on different features, or different training sets, or any kind of variation up to that point in the system). Combining in the log-posterior domain - or taking the geometric mean - has close theoretical relations to the theoretically correct approach when the two systems are conditionally independent given the phone class, which is what you want (you want the systems to behave differently, but not so differently that they don't get the same, right, answer in most cases).

Posterior combination is simply implemented by running two front ends up to the qnsfwd output, then merging the (presumed conformal) LNA posterior output files using lnaMerge (from SoftSound). The merged LNA is then passed on to the decoder as if it had come from a single, better, neural net.

We've had particular success combining different feature streams, such as PLP and MSG, using this approach. In our Broadcast News experiments, we got WERs of 24.5% for an 8000HU PLP net, 29.4% for a similar net based on MSG features, and 23.8% from the posterior combination of the two (a 3% relative improvement over the better of the individual systems). In other tests, we've gotten up to 10% relative improvements; here's some analysis of combining MSG-based MLPs with the PLP-based RNN networks we were using for the TREC evaluations (in collaboration with Sheffield):

Testset/WER%	    RNN98	RNN98+MSG8000	Improvement
h4e_97		    27.2	24.9		2.3 abs / 8.4% rel
h4e_98_1	    25.1	23.3		1.8 abs / 7.1% rel
h4e_98_2	    23.9	22.7		1.2 abs / 5.0% rel

TREC98 eval	    32.0	29.2		2.8 abs / 8.8% rel

Previous: 6.9 How does neural net randomization affect performance? - Next: 7.1 What does it mean to train a speech recognizer?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:16 PDT 2009