EARS: Novel Approaches Objectives

Problems:

As noted in the Rich Transcription pages, current speech-to-text technologies are still quite poor for some applications, particularly for those requiring the transcription of conversational speech. This may at least partially be explained by the following observations:

From a distance, all current ASR systems are the same; they (1) compute the local spectral envelope, (2) determine the likelihoods of speech sounds having generated this envelope, and then (3) search for the most likely HMMs. It seems likely that substantial improvements will require breaking out of this mold in at least some respect.
The spectral envelope (or a simple function of it, such as the mel cepstrum) is distorted by many things, e.g., speaking style, head movement, channel, noise, even small amounts of room reverberation, etc. However, radically different alternatives may be a poor fit to the assumptions of the commonly used statistical models.
Speech recognition systems can be thought of as "half-deaf", given that phonetic recognition is quite poor, particularly for conversational and/or degraded speech. Their success for some applications is due to extensive constraints, such as domain, specific speaker, noise-cancelling mics, etc. These constraints can sometimes mask the underlying fundamental weakness of the technology.

Solutions:

The EARS Novel Approaches program in general (and the ICSI/SRI/UW/Columbia/IDIAP/OGI project in particular) is directed at overcoming these limitations, using several fundamental principles:

Escape the dependence on the spectral envelope - replace (or augment) it with alternative features, especially probabilistic ones (as opposed to simple transformations of local spectral energy)
Use multiple front ends across time/frequency - in particular, representing local spectral information across longer stretches of time, and other more general functions of the time/frequency plane.
Design optimal combination schemes for these alternative feature streams.
Modify the statistical models to accommodate the new front ends. This may be particularly important for the features that are computed over temporal windows that greatly exceed the current practice.

Back to the EARS main page