ICSI Speech FAQ:
6.8 How does neural net size affect performance?

Answer by: dpwe - 2000-08-10


In the single-hidden layer feed-forward multi-layer perceptron neural nets we generally use, the output layer size is determined by the number of phone classes involved, and the input layer size is determined once we have chosen our feature vector and temporal context window width. The hidden layer, however, is a free variable. Increasing the number of hidden units increases the total number of weights (parameters) in the network, which increases the training time more or less linearly, and usually improves performance -- but by a far less predictable factor.

The essence of what we want our neural network to do is generalize. That is, we train it on a certain set of previously-labelled examples, and we want it to reproduce that labeling on unseen examples. If the network has too many parameters, it can afford to learn individual examples from the training set, thereby improving its classification accuracy on those examples (which is what the back-propagation algorithm actually works towards), but without improving - indeed, by ultimately hurting - generalization. In principle, this imposes an upper limit on the size of the network we want to use for a given task, which depends on the amount of training data available. More training data will permit the useful training of a larger network.

In practice, the cross-validation and early stopping criteria used in our training algorithm provide protection against over-training. All this means, however, is that beyond a certain size, adding hidden units ceases to improve performance. In extensive experiments varying hidden layer size and training set size (as reported in a paper at ICASSP-99 entitled Size Matters), we found that net performance continued to improve as training set sizes increased, but that there was an optimal ratio of training patterns to net parameters of about 25:1 - i.e. to train a large 1 million parameter net called for about 25 million training patterns, or somewhat more than 100 hours of training data (at a 16 ms frame step).

This ratio (25:1) is certainly not a universal constant - it will depend on the task, feature vector size, training set variability etc. But it is interesting and helpful to note that it does at least appear to be a constant over several orders of magnitude in training time for a given task. Note also that, although this graph is not shown in the paper, when the training is dataset or net-size limited we still always got an improvement from doubling the training time -- even when it took us away from the ideal ratio. It's just that the improvement was not as great as it would have been if we had stayed within the 25:1 'dip'.


Previous: 6.7 Tell me about the MultiSPERT systems. - Next: 6.9 How does neural net randomization affect performance?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:16 PDT 2009