Neural nets, such as the single-hidden-layer MLPs typically used at ICSI, are extremely versatile classifiers (or function approximators, or various other things). The trick, however, is in finding the correct settings, in the vast parameter space, to achieve the desired input/output transformation (our largeset nets have millions of weights). This is achieved by 'training' the network weights with a large training set of example input/output pairs, with the goal of having the neural net somehow generalize across these relationships in its internal representations, so that previously unseen inputs will result in an appropriate output.
A good training procedure is the key to making the nets a useful tool. Over the years that we have been using nets at ICSI, we have settled on a pretty robust and efficient training procedure, although it took some time to reach this spot (and there are no guarantees of optimality; indeed, you can usually get marginally better performance by forcing the training to run a little longer than normal).
Most neural net training procedures are based on the original 'back-propagation algorithm' from the 1960s, in which the partial derivative of an error measure with respect to each weight in the network is used to shift that weight a little bit in the other direction (i.e. to reduce the error). The size of this step is governed by a parameter called the learning rate, which is a compromise between speed of learning and precision of result.
The procedure employed by qnstrn makes multiple passes through the entire set of training patterns; each pass is called an epoch. After each epoch, the performance of the net is measured directly with a small set of training patterns, called the cross validation (CV) set held out from the main training. The cross-validation frame accuracy (CVFA) is the proportion of patterns in this set for which the net output unit with the greatest activation matches the designated target output.
The specific algorithm that qnstrn uses by default, known locally as "newbob", is to start with a reasonably fast learning rate of 0.008, and to repeat epochs until the CVFA increases by less than 0.5% over the previous epcoh. After that, the learning rate is halved before each epoch to home-in with increasing precision on the local optimum. Initially, reducing the learning rate leads to a big boost in CVFA, but eventually the learning rate becomes so small that the improvements are minimal. Training ceases after an epoch in which the CVFA again improves by less than 0.5%
One advantage of this kind of adaptive training scheme like based on cross-validation is that it provides some protection against over-training. If a very large net was being trained on a relatively small training set, it could in theory learn each of the specific examples in the training set and become perfectly accurate. However, it would not have learned to 'generalize' the problem, and might not behave predictably on unseen patterns. The correct approach is to use a more appropriate ratio of training data to network weights. However, because the cross-validation would not be improved as the network began to overlearn the specific training patterns, our training procedure will tend to stop training before overlearning can occur.
Apart from learning rate control, qnstrn includes options to control the output nonlinearity / error criterion type, the training pattern bunch size, and the arithmetic implementation used. The default output nonlinearity of softmax ensures that the activations sum to 1 (like true posteriors) and is computationally well-matched to the minimum cross-entropy error criterion it implies. Alternatives include sigmoidx (cross-entropy with an conventional sigmoid nonlinearity), sigmoid with a mean-squared error criterion, and linear output units, also with a MSE criterion.
Bunch-mode training accumulates the error for a number of patterns before back-propagating to update the weights. This can offer considerable speed-ups at a small cost in final net precision. We typically use a bunch of 16 and see a small but insignificant degradation in error rate for a factor of 2 or more speedup.
One of the great achievements of the QuickNet software was that it provided a seamless, unified interface to net training that could run either in a floating point or a fixed point implementations. Fixed point is less accurate and more temperamental, and indeed rather slow on a conventional CPU, but it can be very fast when properly vectorized on the 8-way parralel integer ALU of the custom Torrent T0 vector microprocessor of our spert boards. qnstrn still runs significantly faster on the sperts than on conventional processors, but it runs almost identically (modulo arithmetic differences) on them both, compiled from the same code base.
For the practical problem of training a network, you need some features and some targets for your training corpus. Then it is simply a matter of running qnstrn according to the specifications in its man page. You can run it either on your host CPU, or, where available, on a SPERT board.
The most practical way to figure out how to train a net is probably to copy an existing script. Because qnstrn needs a lot of options, it is almost always invoked via a little shell script, and you will find a lot of these scripts around. Several of the /u/drspeech/data trees contain an experiments/ subdirectory containing the files associated with various trainings and testings etc. An experiment directory called trainXX/ might contain a shell script called train.sh (or maybe swbt.run, in honor of the predecessor to qnstrn which was called swbtrn because it was developed specially for Switchboard) showing all the options used to train the net in the directory. This, in conjunction with the man page, should be all you need (I hope).
Previous: 6.2 What kinds of neural nets are there? - Next: 6.4 How long and how many epochs does it take to train a net?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:16 PDT 2009