There is a huge variety of neural network forms that have been tried over the years. For a general introduction, you might try the comp.ai.neural-nets FAQ.
However, at the ICSI speech group, the vast majority of nets are of a single, simple type: Fully-connected, feed-forward multi-layer perceptrons (MLPs) with a single hidden layer. MLPs are a very neutral and generic kind of network architecture that has worked well for us. One explanation for why we use them so much (and so few other people do) is that we have, over the years, developed highly sophisticated and efficient software and hardware for this rather specific task.
In addition to their geometry (number of layers, layer sizes, and interconnections), MLPs are also differentiated by their nonlinearity, as well as by their training criteria. The net training program qnstrn. supports several variants, but we almost always use softmax output nonlinearity with a minimum-cross-entropy error criteria to use in the classic back-propagation algorithm. For more information, see the net training page in this FAQ.
The feed-forward MLP is a memoryless classifier: You present a pattern on its input units, the output units respond with an activation pattern (the phone posteriors in most of our cases), and those outputs depend only on the inputs at that moment. However, several researchers have noted that the appropriate interpretation of a speech sound is highly context-dependent. MLPs can accommodate this somewhat by using a temporal context of several frames (spanning, say, 100-200ms of signal), but it's not a terribly neat way to exploit the problem structure.
Far more efficient is the recursive neural networks as used by Tony Robinson and his co-workers at Cambridge. In these nets, there are a set of context units whose outputs are fed back, via a one-step delay, as inputs for the next training frame. RNNs are much more complicated to train, since each training pattern must be 'unwound in time' to find the appropriate back-propagation values to each parameter. But once the training is cracked, it's a very efficient structure. For the Broadcast News work, the Cambridge RNN with 256 hidden units outperformed our 8000 HU MLP.
Still more exotic neural net structures are being investigated by Lokendra Shastri and Shawn Chang. Neither these models, however, nor the RNNs of Cambridge, are able to take advantage of the optimized QuickNet software or the accelerated Spert vector microprocessors that we can use for MLPs, giving MLPs something of a local unfair advantage.
Previous: 6.1 What is the function of the neural net? - Next: 6.3 How are neural nets trained?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:16 PDT 2009