Research in the Speech Group at ICSI

The major application area that is researched in the Speech Group at ICSI is speech recognition, although some of this work has led to basic research in auditory processing. Overall, there have been three major focii of the work:

Auditory-inspired signal processing: In this work we focus on signal processing transformations that provide increased robustness (particularly for speech recognition) in the presence of acoustic interference such as noise, reverberation, or modified channel frequency response. We tend toward an interest in functional models of human hearing, but maintain an engineering perspective (that is, are willing to use decidedly non-human approaches if they work). In this category, we have developed RASTA approaches, most notably RASTA-PLP. In this style of processing, simple models of forward temporal masking are used to accentuate the components of spectral trajectories that have speech-like temporal character, and discriminate against sources of variance that change at a different rate (typically slower). The more recent form of this approach, J-RASTA, provides robustness to both channel spectral mismatch (between training and test) and additive noise. Current work in this area is focusing on using other auditory concepts to provide greater robustness to the time smearing that occurs with reverberation, and in general getting better models of temporal processing as it relates to speech intelligibility under realistic conditions, both for persons and machines.
Statistical modeling: Since 1988 we have worked in the area of hybrid connectionist/HMM systems. Recently we have been focusing on transition-based models of speech. Some of this work is centered around constraining the statistical models to have some of the temporal properties suggested by work in the previous research category, particularly for the model we have called SPAM (Stochastic Perceptual Auditory-event-based Models). In other work we have developed a recursive algorithm for training transition-based systems that in theory maximizes the posterior probability for the correct model sequence; this approach is called REMAP, which stand for Recursive Estimation and Maximization of A Posteriori Probabilities.
Models of phonology and language for speech understanding: We have also been working on basic questions in language modeling for speech recognition at several levels, including automatic learning of stochastic pronunciation models, and the incorporation of both structural grammars and simple statistical N-gram grammars in speech understanding systems. Towards this end we have built an interactive query system called BeRP, which is the Berkeley Restaurant Project.

There are a number of other areas of research that do not fit neatly in any of these categories. For instance, we have been looking at the sources of degradation in the machine recognition of rapid speech. We are interested in the detection and incorporation of information about accent in our recognition systems. We are researching the interaction between the statistical learning of acoustic and language models. In the last year or two, we have also ramped up a task focus on natural and impromptu meeeetings, with subtasks of online and offline transcription, far-field microphone recognition (for impromptu placement as one might have with a PDA or digital recorder), and modeling of the conversations at many levels.

Nelson Morgan - May 25, 2001