As described on the features introduction page, the goal of features is to reduce or eliminate linguistically-irrelevent variation in the input signal, to give the classifier an easier job. Normalization - which on this page has the narrow meaning of scaling and shifting a dataset to have a mean of zero and a variance of 1 - can be a useful part of removing this unwanted variation.
The Quicknet programs qnstrn , qnsfwd , and qncopy support three types of normalization for their input feature files, selected by the ftrX_norm_mode option, which can be "file", "utts" or "online".
norm_mode=file is the default. A norms file is read (specified by the ftrX_norm_file option) giving offsets and scale factors for each element in the feature vectors. These norms files are generated by running qnnorm , typically over the entire training set. The result is that the feature set presented to the net (or written to the output for qncopy has zero mean and unit variance globally; however, there is no normalization of variation between different utterances (since the same scale and offset are applied to each one), so this doesn't really help recognition. Its purpose is to ensure that the features are well located within the dynamic range of the quantization used in the fixed-point SPERT implementations of the neural nets.
norm_mode=utts is the so-called per-utterance normalization that we started using, following Cambridge's example, for our Broadcast News work. This recalculates the scale and offset constants for each utterance so that every feature element is exactly zero mean and unit variance within that utterance. When the utterances are long enough to have some uniformity to their spectral balance, but the background conditions are rather variable, this can be a very successful, if simple normalization scheme, giving us a 5-10% relative WER reduction on Broadcast News. For short utterances, the possible phonetic imbalance in individual utterances can make this a less good idea. Interestingly, per-utterance normalization helped in the variable-high-noise conditions of Aurora noisy digits, but hurt very significantly in the clean condition, possibly because the absolute level of the features was informative for that dataset.
On-the-fly calculation within qnstrn and qnscpy involves reading through each utterance twice: once to calculate the within-utterance mean and variance, and a second time to actually read in and normalize the features. This can cause a significant additional load when the features have other online calculations such as deltas. You can pre-compute the per-utterance normalization with qncopy (which simply applies the same feature pre-processing as qnsfwd and qnstrn, but writes the features out) or the original pfile_normutts command.
norm_mode=onl The third form of normalization supported by QuickNet is so-called online normalization, where the means and variances of the input feature stream are constantly re-estimated. This has the attractive quality of requiring no look-ahead (thereby imposing no fixed processing delay, or two-pass calculation) while still allowing dynamic adaptation to unanticipated feature conditions.
Previous: 5.4 How do you calculate MSG features? - Next: 5.6 What are delta features? How do you calculate them?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:16 PDT 2009