ICSI Speech FAQ:
5.1 What are features? What are their desirable properties?
Answer by: dpwe - 2000-05-26
The statistical pattern recognition approach to speech recognition
is based on the idea that, even though no two utterances of the
same words are exactly alike, they share distinctive characteristics
that may be learned by a suitably sophisticated pattern classification
system. The simplest form of this is some kind of partitioning of
a multidimensional space into regions corresponding to each possible token
- for instance, the phone units identified by linguists.
Speech is represented in its most basic form as waveform data. However,
it turns out that trying to build a classifier that can classify this
representation directly is pretty difficult. Alternative representations,
derived from the waveform data by a more-or-less simple sequence of
signal processing operations, can give the pattern recognition a much
easier task, leading to a more successful system. These specialized
representations are known as features, and there is considerable research
into the most successful form for them, which depends both on the nature
of speech, any corruption that might attach to the speech, and the
characteristics of the statistical classifier to be used.
Given this role, as the representation of the basic waveform data in
a space that makes statistical classification easier, we can list
some desirable properties for feature sets:
- They should preserve or emphasize information and variation in the speech that is relevant to the phonemic classification task (or whatever other basis is being used for the speech recognition) while minimizing or eliminating variation irrelevent to that task. The classic example is fundamental frequency (pitch or F0), which has no useful bearing on phonemic identity (in English anyway), and which is largely factored out of common feature representations such as MFCC.
- Features should work towards invariance within the classes of the subsequent classification. For instance, if all instances of /ah/ resulted in the same feature value (different from the value for other phones), the classifier could be extremely simple. This is really the same point as above, but note that it applies not just to irrelevant vocal characteristics (pitch, gender, age, etc.) but also to environmental/channel characteristics (channel bandwidth, background noise, reverberation). A lot of work in feature design is the search for representations that are relatively invariant to these factors.
- Feature space should be relatively compact (low-dimensional) to make it easier to learn models from finite amounts of data. The feature space should ideally be uniformly populated with clearly-defined class regions, as opposed to having all the 'action' in a few subspaces, separated by large ranges of feature values with no meaningful interpretation. This allows a sensible use of the dynamic range in finite quantizations of feature and model parameters.
- A feature representation that is generally applicable -- i.e. that can be used without much consideration in most circumstances -- is of course preferable to one that needs to be tuned to particular circumstances, or that needs to be checked carefully to see if it is appropriate to a new task. However, a special-purpose feature representation that supports significantly more accurate classification is of course worthy of extra attention.
- All other things being equal, feature calculation should be computationally inexpensive. Processing delay (i.e. how much of the 'future' of the signal you have to know before you can emit the features) is a significant factor in some settings, such as real-time recognition.
Previous: 4.4 How can I make acoustic measurements? - Next: 5.2 What features are commonly used?
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009