This project is funded through the Learning and Intelligent Systems (LIS) initiative, beginning in October of 1997. It brings together scientists of highly diverse intellectual background (brain imaging, auditory physiology, computer science and linguistics) to focus on a common issue, namely how does the brain proceed from acoustic signal processing to understanding spoken language? Because of the highly complex nature of the general problem, the project focuses on the processing of syllabic elements within the speech stream, on the assumption that this linguistic representation is likely to serve as a major interface between sound and meaning.
The project has five components:
(1) Computational Modeling and Learning,
(2) Human Brain Imaging of Speech Processing,
(3) Processing of Speech in the Auditory Cortex ,
(4) Human Speech Perception, and
(5) Statistical Models
Steven Greenberg (ICSI - Principal Investigator)
Timothy Roberts (UC-San Francisco)
Christoph Schreiner (UC-San Francisco)
Lokendra Shastri (ICSI - co-Principal Investigator)
CONSULTANTS AND COLLABORATORS
Takayuki Arai (Sophia University, Tokyo)
David Poeppel (University of Maryland)
Beverly Wright (Northwestern University)
Richard Eyraud (UC-San Francisco)
Mark Kvale (UC-San Francisco)
Rosaria Silipo (ICSI)
Shuangyu Chang (ICSI)
Steven Cheung (UC-San Francisco)
Joy Hollenback (ICSI)
Links to relevant sound files
COMPUTATIONAL MODELING AND LEARNING (Lokendra
The syllable is playing an increasingly important role in the design of automatic speech recognition systems, providing an intermediate representation capable of binding the phonetic and phonological tiers with the lexical and grammatical levels of spoken language. The syllable's significance makes it imperative that speech recognition systems be capable of reliably detecting and segmenting syllabic entities in the acoustic stream.
L. Shastri and graduate student S. Chang are currently focusing on developing an automatic method for segmenting the acoustic speech stream into syllabic units using the biologically motivated Temporal Flow Model (TFM) of Watrous and Shastri . TFM admits feedforward, lateral, as well as recurrent connections, and supports variable propagation delays along links. Variable delay links and recurrent connections provide a means for smoothing and differentiating the input signal, measuring feature durations in the signal and detecting their onsets. Multiple variable delay links between nodes allows the system to retain context over a specifiable window of time and thus enables the network to perform spatio-temporal feature detection and pattern matching required for speech processing.
Network models have been trained using biologically motivated representations of the acoustic signal such as modulation-spectral features [Greenberg and Kingsbury, 1997] and waveform envelope acceleration and velocity features recently developed by R. Silipo and S. Greenberg.
Two distinct TFM network configurations are being investigated, one with tonotopic connectivity, the another with global connectivity. With respect to modulation-spectral features, both networks receive inputs corresponding to a single critical-band-like (1/4 octave) channel and both contain two hidden layers (H1, H2). However the networks differ with respect to (a) how the input layer is connected to H1 and (b) the lateral connections within H1. In the global configuration all input nodes are connected to all H1 nodes and all H1 nodes are densely connected to each other via lateral links. In the tonotopic configuration H1 nodes are divided into five distinct groups, each receiving activation from three adjacent input nodes (i.e., channels). Nodes within a group are densely connected but nodes across groups have sparse connections. In both configurations, H1 nodes are fully connected to H2 nodes which, in turn, project to the output node.
Both networks have been trained on a phonetically transcribed corpus (OGI Numbers95) consisting of telephone and address numbers spoken over the telephone by several hundred individuals (of variable dialect, age and gender). The target for each syllabic segment was a Gaussian function spanning the segment. A two-level, dynamic thresholding method was used for interpreting the network outputs. The computational role of hidden nodes, recurrent links, and multiple delayed links in extracting syllabic features was also investigated and the resulting observations confirmed the networks' ability to integrate across time and perform hierarchical feature extraction. Several TFM networks were also trained to recognize individual digits from the acoustic stream, using RASTA features (Hermansky and Morgan, 1994) as the spectral input representation (Greenberg and Kingsbury, 1997).
Significant progress has been made over the last year in the development of Shruti, a model of rapid symbolic processing and reflexive reasoning based on notions of temporal synchrony. The extended model can (i) dynamically instantiate entities during reflexive processing, (ii) dynamically unify multiple entities by synchronizing the firing of nodes corresponding to these entities, if the context suggests that these entities might be the same, and (iii) allow the simultaneous propagation of activation in the forward as well as the backward direction.
Progress has also been made in work on Smriti, a model of one-shot learning which accounts for how long-term potentiation can lead to the rapid formation of circuits reponsive to bindings and binding errors. The model demonstrates how such circuits can be formed rapidly upon the presentation of a rhythmic pattern of activity wherein bindings are expressed via synchronous activity of cell-clusters. Over the past year, the signal-to-noise ratio of regions responsive to bindings and binding errors have been examined.
HUMAN BRAIN IMAGING OF SPEECH PROCESSING (David Poeppel, Tim Roberts, Steven Greenberg )
This component focuses on the investigation of the temporal encoding of information as a potentially new dimension for functional (behavioral) assessment of human listeners. The research has focused on the observation that the precise latency of the auditory evoked neuromagnetic field component, the M100, varies with stimulus attributes, particularly frequency (Roberts and Poeppel, 1996) . Research over the past year has investigated the influence of spectral characteristics of complex signals (such as amplitude-modulation sinusoids, pulse trains and synthetic vowels) on the latency of the M100 component of the MEG.
Auditory theory has traditionally pitted 'place' (the tonotopically organized spatial pattern of excitation) versus 'time' (the temporal pattern of discharge) with respect to the neural representation underlying specific attributes of acoustic sensation. A potential resolution of this historical opposition is proposed, in which place and time are viewed as flip sides of a complex representational matrix of neural activity, bound together through the mechanics of the cochlear traveling wave and its expansion at the level of the auditory cortex (Greenberg et al., 1998).
The apical component of the cochlear traveling wave serves to format the peripheral spatio-temporal response pattern in a manner germane to frequency analysis and the perception of pitch. The cochlear delay (dc) conforms to the equation:
dc = p + tc (1)
where tc is a transmission time constant of 2 ms and p is the period (in ms) of the resonant or fundamental frequency. Thus, the cochlear latency for a 1-kHz signal is 3 ms and that of a 100-Hz signal 12 ms, representing a latency differential of 9 ms. This spatial-latency representation is enhanced at the level of the auditory cortex to provide a set of robust cues for both pitch and timbre that effectively segregates neural activity into discrete loci on the basis of both place and time.
This latency-based representation can account for many perceptual properties of pitch and timbre that can not easily be accommodated within the traditional theoretical framework of frequency analysis and pitch.
PROCESSING OF SPEECH IN THE AUDITORY CORTEX (Christoph
The response of neurons in the primary (AI) and posterior (P) auditory cortical field of the anesthetized squirrel monkey are being investigated with respect to sequences of four distinct stop-consonant syllables ([pa],[ti],[ku],[pi]). Each CV sequence was naturally spoken at a constant rate. Sequences of faster and slower speaker rates were created by changing the interval between adjacent syllables and by shortening or expanding the duration of the vocalic segment (accomplish via deleting or inserting an integral number of glottal periods in the mid-section of the vowel nucleus. Using this method it has been possible to insure that CV transitions and the endpoint of each vowel remain the same for all conditions. Six speaking rates have been used, ranging between 3.5 and 8.2 Hz. Signals were monotically presented at a sound pressure level of 50 dB.
The responses of auditory cortical neurons have been studied in three animals to date. Preliminary analyses indicate that responses in primary auditory cortex (AI) are evoked primarily by consonantal segments while responses in the posterior field are dominated by the responses to the vocalic portions of the syllable. The results also show strong effects of speaker rate on response strength and response latencies for the second, third and fourth syllables in the stimulus sequence. In AI the strongest responses were observed for the slowest speaking rate and the weakest responses for the fastest rate of speaking. In contrast, the posterior field manifests about equally strong responses across all six speaking rates. The data suggest a robust coding of vocalic segments in the posterior field, largely independent of speaking rate (at least over the natural range). In contrast, AI responses to consonantal segments were strongly reduced for the faster speaking rates.
A preliminary conclusion of the studies conducted so far is that different segmental components of syllables may be optimally coded in different cortical fields. To test this hypothesis, we intend to conduct comparable experiments in the anterior auditory field of squirrel monkeys and to test the effect of speaking rate as a function of stimulus intensity and signal-to-noise ratio.
HUMAN SPEECH PERCEPTION (Steven
Two separate perceptual studies have recently been completed and others are in the process of being conducted.
STATISTICAL PROPERTIES OF PRONUNCIATION VARIATION (Steven
Current-generation automatic speech recognition (ASR) systems model spoken discourse as a linear sequence of words and phones. Because it is unusual for every phone within a word to be pronounced in a standard ('canonical') way, ASR systems often depend on a multi-pronunciation lexicon to match an acoustic sequence with a lexical unit. Since there are, in practice, many different ways for a word to be pronounced, this standard approach adds a layer of complexity and ambiguity to the decoding process which, if modified, could potentially improve recognition performance. Systematic analysis of pronunciation variation in a corpus of spontaneous English discourse (Switchboard) demonstrates that the variation observed is systematic at the level of the syllable (Greenberg, 1997; 1998)
A NATO Advanced Study Institute (held in July, 1998 at Il Ciocco in Italy) was organized on the topic of 'Computational Hearing.' This ASI brought together graduate students, post-docs, junior and senior researches to examine auditory processing of speech and other complex signals from a computational perspective. The ASI was attended by 103 individuals (about 40 of whom were from the U.S., the remainder from Europe, Canada, Japan, Turkey and Israel). Distinguished, senior researchers lectured on various topics germame to auditory processing, physiology and anatomy. The students participated by presenting posters and engaging the faculty in intensive interactions and discussion. A proceedings volume and a course reader were produced. Two additional books, based on the ASI are due to be published within the next year ('Computational Models of Auditory Function,' IOS Press, which be published in June, 1999). http://www.icsi.berkeley.edu/real/comhear98.
PUBLICATIONS GERMANE TO THIS PROJECT
Silipo, R., Greenberg, S and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, proceedings of Eurospeech 1999, Budapest, in press.
Fosler-Lussier, E., Greenberg, S., and Morgan, N. (1999) Incorporating Contextual Phonetics Into Automatic Speech Recognition.invited paper for Plenary Session "The Phonetics of Spontaneous Speech," ICPhS-99, San Francisco, CA, August 1999. To appear.
Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable Detection and Segmentation Using Temporal Flow Neural Networks. (Postscript only) Proceedings of the Fourteenth International Congress of Phonetic Sciences, San Francisco, August 1999.
Silipo, R., Greenberg, S. (1999) Automatic Transcription of Prosodic Stress for Spontaneous English Discourse. "The Phonetics of Spontaneous Speech," ICPhS-99, San Francisco, CA, August 1999.
Arai, T. and Greenberg, S. (1997) The temporal properties of spoken Japanese are similar to those of English, Proceedings of Eurospeech, Rhodes, Greece, pp. 1011-1014.
Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seatle, pp. 933-936.
Greenberg, S. (1997) Auditory function, in Encyclopedia of Acoustics,,M. Crocker, editor. New York: John Wiley, pp. 1301-1323.
Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32.
Greenberg, S. (1997) The significance of the cochlear traveling wave for theories of frequency analysis and pitch, in Diversity of Auditory Mechanics, E. R. Lewis and C. Steele, eds. Singapore: World Scientific Press, in press.
Greenberg, S. (1997) The Switchboard Transcription Project in Research Report #24, 1996 Large Vocabulary Continuous Speech Recognition Summer Research Workshop Technical Report Series. Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD (56 pp.).
Greenberg, S. (1998) A syllable-centric framework for the evolution of spoken language. Commentary on MacNeilage, P. The frame/content theory of evolution of speech production. Brain and Behavioral Sciences, 21, 518.
Greenberg, S. (1998) In search of the Unicorn: Where is the invariance in speech? Commentary on Sussman, H. Fruchter, D., Hilbert, J. and Sirosh, J. Linear correlates in the speech signal: the orderly output constraint. Brain and Behavioral Sciences, 21, 267-268.
Greenberg, S. (1998) Recognition in a new key - Towards a science of spoken language, in ICASSP98, International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 1041-1045.
Greenberg, S. (1998) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, Kekrade (Netherlands), pp. 47-56.
Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678.
Greenberg, S. and Kingsbury, B. (1997) The modulation spectrogram: In pursuit of an invariant representation of speech, in ICASSP-97, IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, pp. 1647- 1650.
Greenberg, S. and Shire, M. (1997) Temporal factors in speech perception, in CSRE-based Teaching Modules for Courses in Speech and Hearing Sciences. London, Ontario: AVAAZ Innovations, pp. 91-106.
Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77.
Greenberg, S., Poeppel. D. and Roberts, T. (1998) A space-time theory of pitch and timbre based on cortical expansion of the cochlear traveling wave delay, in Psychophysical and Physiological Advances in Hearing, A. Palmer, Q. Summerfield, A. Rees, R. Meddis (eds.) London: Whurr Publishers, pp. 293-300.
Greenberg, S. and Slaney, M. (editors) (1998) Proceedings of the NATO Advanced Study Institute on Computational Hearing, Il Ciocco (Italy).
Hermansky, H., Morgan, N. (1994) RASTA processing of speech. IEEE Transactions on Speech and Audio Processing, 2, 578-589.
Kingsbury, B., Morgan, N. and Greenberg, S. (1997) Improving ASR performance for reverberant speech, in Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 87-90.
Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Robust speech recognition using the modulation spectrogram, Speech Communication, 25, 117-132.
Mandziuk, J. and Shastri, L., "Incremental Class Learning approach and its application to Handwritten Digit Recognition", Proceedings of the Fifth International Conference on Neural Information Processing ICONIP'98, Kitakyushu, JAPAN, October, 1998., vol. , (1988).
Roberts, T. P. and Poeppel, D. (1996) Latency of auditory evoked M100 as a function of tone frequency. Neuroreport, 7, 1138-1140.
Roberts, T.P.L. Poeppel, D. and Rowley, H. A. (1998) A quantitative comparison of FMRI and MEG during phonetic and pitch discrimination with speech sounds, Society for Neuroscience,
Roberts, T.P.L., Ferrari, P. and Poeppel, D. (1998) Latency of evoked neuromagnetic M100 reflects perceptual and acoustic stimulus attributes, Neuroreport, 9:3265-3269 (1998).
Roberts, T.P.L., Poeppel, D. and Rowley, H. A. (1998) Magnetoencephalography and functional MRI: A quantitative study of speech perception, Proc. IEEE Eng. Med & Biol. Soc.., 20(4): 2120-2123.
Roberts, T.P.L., Stufflebeam, S.M., Rowley and Poeppel, D. (1998) Auditory evoked neuro-magnetic fields: what modulates the M100 latency?, Proc. Internat. Meeting on Biomagnetism .
Schreiner, C. E. and Wong, S. W., "Context-Dependent Excitation Bandwidth Changes in Cat Auditory Cortex.", Proceedings of the 5th International Conference on Neural Information Processing, vol. 1, (1998). Published
Shastri, L. (1998a) Advances in Shruti - A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony, Applied Intelligence. In Press.
Shastri, L. (1998b) Recruitment of binding and binding-error detector circuits via long-term potentiation. Neurocomputing. In Press.
Shastri, L. Chang, S. and Greenberg, S. (1999) Syllable Detection and Segmentation Using Temporal Flow Neural Networks.Proceedings of the Fourteenth International Congress of Phonetic Sciences, San Francisco, August 1999.
Shastri, L. and Wendelken, C. (1999) Soft Computing in Shruti --- A neurally plausible model of reflexive reasoning and relational information processing. Proceedings of the Third International Symposium on Soft Computing, SOCO'99, Genova, Italy. June, 1999.
Wu, S-L., Shire, M., Greenberg, S. and Morgan, N. (1997) Integrating syllable boundary information into speech recognition, in ICASSP-97, IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, pp. 987-990.
Wu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Incorporating information from syllable-length time scales into automatic speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 721-724.
Wu, S.-L., Kingsbury, B., Morgan, N. and Greenberg, S. (1998) Performance improvements through combining phone- and syllable-length information in automatic speech recognition, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 854-857.
1. Discusses the significance of spatio-temporal neural networks.
2. Describes the "Temporal flow model" used in project related research.
3. Describes how target functions are specified for training temporal flow models.
Describes a connectionist model that uses temporal synchrony for encoding and propagating dynamic bindings within structured networks. The model supports high-level cognitive behavior such as reasoning and planning.
Describes work that models the rapid formationation of binding and binding-error detector circuits.
Contains a bibliography of recent publications of Dr. Greenberg. Most of the articles are downloadable in electronic form
Provides detailed description of the phonetic properties (transcription) of spontaneous speech (Switchboard corpus)
Contributions of this project:
CONTRIBUTIONS WITHIN DISCIPLINE:
Very little is currently known about the specific auditory and higher cortical mechanisms underlying the perception of spoken language. The current project is beginning to fill in some of the gaps in our knowledge pertaining to the specific neural processes associated with listening to speech through using a multi-disciplinary approach (brain imaging, neurophysiology, computational modeling, perception and statistics) to better characterize the means by which listeners decode speech. Most current theories focus of specific sound segments (called phones) as the basic building block for speech understanding. In contrast, the current project focuses on syllables, which appear to have a more direct relation to neurological processing than phonetic segments.
The computational models being developed by Dr. Shastri are unique in their complexity and sophistication that are applied to spoken language. It is hoped that these neural network models will eventually enable segmentation and phonetic labeling of speech to be done automatically. This will help in developing more sophisticated tools for automatic speech recognition and in helping children learn to read (particularly those who curently experience difficulty in so doing). The computational models being developed by Dr. Shastri are also expected to shed light on how temporal synchrony - at varying temporal resolutions - might be utilized by distributed brain circuits to solve the binding problem across multiple representational tiers.
The MEG studies directed by Dr. Roberts promise to yield exciting new information about the time course and locus of brain processing germane to speech perception. Currently, there is relatively little data germane to MEG correlates of speech.
Dr. Schreiner's study on single-unit correlates of spoken language is also likely to yield valuable new insights since there are virtually no physiological data, as of yet, on this important topic. His data are likely to complement those of Dr. Roberts' since they are investigating comparable anatomical regions (albeit in different species), but at different levels of spatial and temporal granularity.
Dr. Greenberg's investigations of the perceptual processes underlying spoken language are likely to contribute new information regarding the importance of the modulation spectrum to speech understanding. His studies have already demonstrated that classical models of speech processing, based on detailed spectro-temporal characterization of the acoustic signal can not possibly account for intelligibility of sentential material. The studies raise the possibility that a model based on syllabic units is more likely to account for how listeners actually decode the speech signal.
Dr. Greenberg's statistical studies of spoken language are also helping to provide an empirical foundation to what has previously been largely a speculative field. His phonetic transcription of the Switchboard corpus has already been used by many speech recognition researchers and is likely to have a significant theoretical impact on linguistic and speech research as a whole.
CONTRIBUTIONS TO OTHER DISCIPLINES:
Automatic speech recognition researchers have traditionally used knowledge from linguistics to develop computational models for spoken language. Although the models work reasonable well for carefully spoken materials, they fail when confronted with spontaneous speech typical of everyday discourse. The statistical properties of spoken language provided by the current project have already begun to have an impact on how speech recognition systems are trained and developed. It is anticipated that this impact will grow over the next several years.
The studies conducted by Drs. Schreiner and Roberts are likely to help clinicians deal with children and adults experiencing problems understanding spoken language. Their studies are also likely to contribute to a deeper understanding of the time course of cognitive processes in the brain germane to speech understanding.
CONTRIBUTIONS TO EDUCATION AND HUMAN RESOURCES:
Understanding the cognitive and neural processes underlying the perception of spoken language will inevitable help to foster a more effective educational environment that currently exits. It will help in teaching young children to read faster and more effectively since it will enable teachers to provide a more systematic and accurate description of the relation between the written word and spoken language.
CONTRIBUTIONS TO RESOURCES FOR SCIENCE AND TECHNOLOGY:
This project has already made a significant impact on the design and development of automatic speech recognition systems. The statistical properties of spoken language, garnered through analysis of the phonetic transcription of the Switchboard corpus, has enabled speech engineers to develop more accurate and reliable models of spoken language. Some of this material has been posted on the World Wide Web (http://www.icsi.berkeley.edu/~stp) and has been used by many research sites around the world.
It is anticipated that the project will continue to make significant contributions to speech technology through development of new algorithms for automatic analysis of spoken language as well as potentially providing diagnostic tools for assessing the ability of comprehend spoken language through neural imaging techniques.
CONTRIBUTIONS BEYOND SCIENCE AND ENGINEERING:
The future of computing is 'speech.' This often-cited quotation
by a major industry figure is likely to be realized within the next decade.
The current project is likely to contribute to the public welfare by making
the ability to communicate with digital devices via spoken language both
practical and reliable. Current speech recognition programs (e.g., Naturally
Speaking, ViaVoice) work well only under highly constrained speaking condition,
with little or no background noise. For digital dialogues to become commonplace
a far better understanding of spoken language will be required, particularly
for speech spoken under noisy and reverberant conditions and under highly
informal circumstances. Some of the studies in the current project will
contribute to this understanding, by providing information about the sort
of neural and linguistic representations important for understanding spoken
language. This knowledge will be useful for developing algorithms that
enable machines to interact with humans using voice.