# Computational Hearing

### July 3 Anatomy and Physiology of the Auditory Pathway

N.G. Bibikov - The modelling of discharge pattern in auditory nerve fibers and second order auditory units.

Simulation of discharge pattern in nerve cells can provide a perspective opening into understanding of auditory processing. A PC based system for the modelling of the first and second-order auditory units have been created. The peripheral layer of our system is represented by a bank of linear filters. The impulse response of the filters simulated the impulse response of basilar membrane. A rectification (depolarization/hyperpolarization ratio is 3:1) and an integration with time constant of 0.2 ms is included in hair cell's models. The Gaussian noise simulated random fluctuations of neuron's threshold and membrane potential. We omitted a modelling an unproved phenomenon of hair cell transmitter depletion. The special attention was paid to a modelling of postspike changes in spiral ganglion cells. The hazard function of spontaneous activity in a typical auditory fiber could be properly reproduced in our model only in case when two components of refractoreness with time constants of 1 ms and 20 ms had been employed. If both of these components depended only on a time interval after one previous spike, the poststimulus histogram of the response to tone burst did not reveal any realistic adaptation behavior. However, if the long term refractoreness had been accumulated after successive spikes, a realistic shape of PSTH could be obtained. Models with different parameters as well as with an addition of inhibitory input were used for reproduction of discharge pattern in secondary auditory units. The dynamic properties of each model were described by PST histograms, phase histograms and cross-covariance functions (CCF) between the recorded discharge sequence and the modulation waveform. We compare discharge patterns for different variants of the model and real spike dischage sequences in secondary auditory units of the frog using the same methods of acquisition and processing. The model demonstrates the important role of delayed lateral inhibition for effective amplitude modulation encoding in cochlear nucleus neurons.

K. Davis and E. Young - Interneuronal circuitry in an auditory nucleus: New insights from pharmacological manipulations.

Immunocytochemical, anatomical, and physiological evidence suggests that dorsal cochlear nucleus (DCN) principal cells, fusiform and giant cells, receive significant inhibitory input. Often, these cells display highly nonmonotonic input-output functions; in type IV units, as stimulus level at best frequency (BF) increases, discharge rate first increases above then decreases below spontaneous rate (SR). Pharmacological studies in anesthetized preparations have shown that most of the stimulus-evoked inhibition in principal cell responses is blocked by strychnine (Caspary et al. 1987; Evans and Zhao 1993), suggesting a predominance of glycinergic input; blocking GABAA-mediated inhibition with bicuculline typically only increased a unit's SR. Recently, however, shock-evoked GABAergic post-synaptic potentials have been described in fusiform and cartwheel cells (Golding and Oertel 1996). Moreover, cartwheel cells are now known to respond to sound (Parham and Kim 1995; Davis and Young 1997) and contact both principal cell types (Golding and Oertel 1997); previously, cartwheel cells were not thought to affect DCN principal cell responses to sound. The goal of this study is to reassess the roles of glycinergic and GABAergic inhibition in the DCN in light of the current understanding of DCN circuitry.

Here, we report on the effects of iontophoretic application of strychnine and bicuculline on the responses of principal cells and interneurons in decerebrate cat. Consistent with previous studies, strychnine eliminates the central inhibitory area in type IV units (n=22) resulting in monotonic BF rate-level curves. Unexpectedly, bicuculline primarily lowers the threshold of on-BF inhibition and thereby enhances inhibition in type IV units (n=12); SR is typically unaffected. Similar contrasting effects of strychnine and bicuculline are observed in other principal cell response types (type IV-T and type III units); the effect of bicuculline is particularly pronounced in these units, converting even monotonic type III responses into nonmonotonic (type IV) rate-level curves (n=6). The enhancement of on-BF inhibition by bicuculline suggests a disinhibitory process involving GABAA action (perhaps from stellate cells) on a non-GABAA-ergic inhibitory pathway. This pathway could be glycinergic: it could involve deep-layer type II units (vertical cells) or it could involve superficial complex-spiking neurons (cartwheel cells), as both of these cell types are disinhibited by bicuculline. Taken together, the results suggest that glycine directly, and GABAA indirectly, mediates the on-BF inhibition of DCN principal cells. [Work supported by NIDCD grants DC00979 and DC00023]

E.F. Evans - Modelling Characteristics of Onset-I Cells in Guinea Pig Cochlear Nucleus

This model, implemented in LabVIEW, was designed to test two rival mechanisms pr oposed to account for the nature of onset-I cells: coincidence versus depolarisation block. It was found necessary to base the model primarily on coincidence detection, utilizing short duration epsps (1-2 ms) and high threshold detection of converging inputs from 10 cochlear nerve afferent model fibres spanning a wide r ange of CFs, plus a small degree of ìdepolarisation blockî. This could then account satisfactorily for a wide ra nge of onset-I characteristics including greater sensitivity to noise than CF tone stimuli, and strong contrast between p eriodicity detection of the fundamentals of cosine and random phase-mixed harmonics.

I.C. Gebeshuber, F. Rattay - Modeling the Human Hearing Threshold Curve for Pure Tones: The Effects of Stereociliary Brownian Motion, Endogenious Transduction Channel Noise, Stochasticity in Neurotransmitter Release and Innervation Density in Various Frequency Bands

See page 7 of the ASI Proceedings.

Lisa C. Gresham and Leslie M. Collins - Signal detection theory analysis of two computational auditory models

Traditional methods of analyzing human auditory processing and predicting psychophysical data have primarily concentrated on either correctly modeling the neural representations of acoustic signals or incorporating known physiological limitations of the auditory system. Typical approaches include the use of signal detection theory to fit Receiver Operating Characteristic curves to neural data and the use of computational auditory models to predict neural firing patterns. Although valuable, neither of these approaches addresses an issue critical to the design of improved remediation devices, namely signal "detectability". Basic signal processing models, such as energy- or envelope-detectors, have been formulated and used to predict detection performance, however, these methods do not incorporate physiological detail.

The theoretical performance predicted by such standard techniques often exceeds experimental performance on a task such as the detection of a tone in noise. The observed discrepancy is commonly attributed to additive "internal noise", the variance of which is adjusted until matching results are obtained. Our hypothesis is that a method which incorporates both the physiological details and the stochastic nature of the auditory system can be used to obtain more accurate predictions of detection performance without the addition of unexplained "internal noise". Computational models of human auditory processing provide a platform on which to apply signal detection theory directly to the "output" of various peripheral processes. The work presented here illustrates how signal detection theory can be integrated with two different computational auditory models [L.H. Carney, J. Acoust. Soc. Am., 93(1), 1993; Patterson et al., J. Acoust. Soc. Am., 98, 1995] and used to predict detection performance and study the effects of signal uncertainty. The results demonstrate that an integrated approach both supports and improves upon traditional methods of analysis.

P.M. Hofman, A.J. van Opstal, H.H.L.M. Goossens - Analysis of Monkey Inferior Colliculus Activity

The inferior colliculus (IC) in the mammalian midbrain is generally believed to play an important role in auditory processing. Research in several species has shown that the IC is involved in sound localization and that its neural activity encodes spatial information about the sound source. In this study, we investigated the IC of the monkey about which, so far, little is known. It is our objective to relate the IC activity to behavioral, spatial and acoustic parameters.

So far, one rhesus monkey (head-fixed) was trained to fixate visual LEDs as well as sound stimuli at random locations within the 2D oculomotor field. Sound stimuli consisted of broad-band noise, pure tones and frequency-modulated (FM) sweeps (typical duration 500 ms and frequency range [0.2, 16] kHz). Eye position, spiking activity (single- and multi-unit) and acoustic stimuli were recorded.

IC tuning to static as well as dynamic parameters of the stimulus was found. Neurons were often found to be sharply tuned to a particular frequency, the value of which systematically varied with electrode depth. Sometimes, neurons were sensitive to the sweep-velocity of the stimulus. Response latency was always found to be extremely short (about 5 ms).

Dependence of the activity on the sound source position was also observed. Modulation of the mean firing rate sometimes related monotonically to sound azimuth. Occasionally, modulation was found to correlate with saccadic eye movements. These findings support the hypothesis of IC involvement in acoustic orientation.

Finally, the spiking patterns of single units, and also multiple units, often proved to be unique for a specific stimulus. For example, neurons yielded different patterns for different noise bursts, but the patterns were highly reproducible upon repetition of individual (frozen) noise bursts. In order to relate the acoustic input to the neural output, we applied linear and non-linear systems analysis. The applicability of a feed-forward neural network in this analysis was investigated.

Sridhar Kalluri and Bertrand Delgutte - A model of ventral cochlear nucleus onset units

Research on speech perception, sound localization, and auditory scene an alysis has shown that rapid acoustic transients are perceptually important. Thu s, neurons in the ventral cochlear nucleus (VCN) that respond to rapid acoustic events (onset responders) may have an important role in auditory perception. We are studying the role of VCN onset neurons in auditory signal processing via a mathematical model.

Our model of an onset responding cell is an integrate-to-threshold point neuron with two components: a) membrane dynamics characterized by two time con stants and an absolute refractory period, b) excitatory synaptic inputs that der ive from model auditory nerve (AN) fibers with a range of characteristic frequen cies (CF) (Carney, JASA 93: 401-417). We separately fit the two components of the model. The parameters for the membrane dynamics are fit using intracellular recordings of octopus cell voltage responses to current steps (Golding, persona l communication). The parameters that specify the AN synaptic input connection pattern are fit using onset unit responses to tones and noise.

In order to fit the data, one time constant of the model membrane dynami cs, that represents the voltage integration process, must be between $0.2$ and $0.5$ milliseconds. The second time constant, which represents a threshold accom odation process, is two to three times slower. These membrane dynamics explain the absence of onset unit responses to high frequency continuous tones ($> 2$ kH z) and the response of onset units to every cycle of low frequency tones ($< 1$ kHz).

The peri-stimulus time histograms of onset unit responses to high freque ncy tone-bursts have a large initial response peak followed by a small steady-st ate response. In order to get this response property, the effect of a single AN synaptic input on membrane potential must be far less than threshold. Onset un its phase-lock and entrain (spike on every stimulus cycle) to low frequency tone s better than AN fibers and other VCN units. In order to obtain these character istics, the number of independent AN inputs to the model neuron must be large ($> 40$). Threshold of onset units to noise-bursts exceeds threshold to CF tone-b ursts by a smaller amount in onset units than in AN fibers or other CN units. I n order to fit this property, the model onset neuron must have AN inputs spannin g a broad range of CFs. These constraints on the model inputs are consistent wi th connectivity patterns derived from anatomical studies.

In order to predict onset unit responses for a variety of stimuli, sever al constraints must be met simultaneously: coincident inputs, fast membrane dyn amics, and a high-pass'' filtering process in the membrane such as threshold a ccomodation. We are now in a position to use this well-constrained model for ev aluating hypotheses concerning speech coding and correlates of psychophysical ph enomena in VCN onset responders.

Katuhiro Maki, Kaoru Hirota and Masato Akagi - A Functional Model of the Auditory Peripheral System: Responses to Simple and Complex Stimuli

See page 13 of the ASI Proceedings

L.M.Miller, M.Escabi and C.E.Schreiner - Synchrony in The Thalamocortical System of the Ketamine-Anesthetized Cat and the Effects of Dynamic Ripple Stimulation

In the ketamine-anesthetized cat, widespread and non-functional synchrony occurs in the lemniscal thalamocortical system under both spontaneous and tone-driven conditions. These correlations are observed in the spike trains of simultaneously recorded single cortical (A1) and thalamic (MGBv) units. The synchronous activity is usually weakly and sometimes strongly oscillatory, nearly always in the frequency range corresponding to spindles (7-14Hz). It is nonfunctional in the sense that its presence and strength is largely uncorrelated with the cells' spectral and temporal receptive field properties and with the known gross anatomical connections between thalamus and cortex. Such patterns of activity strongly suggest that the thalamocortical system of the ketamine-anesthetized cat is commonly in the firing state known in the thalamic literature as "burst mode". We also investigate the effects of noise-like dynamic ripple stimulation on the thalamocortical correlations. Spectrotemporal receptive fields derived with ripple stimulation provide information about many of the complex receptive field properties that have traditionally been studied, including characteristic frequency, FM sweep preference, modulation transfer properties, and binaural interaction class. Preliminary data indicate that when the thalamocortical system is in a highly synchronized state, stimulation with dynamic ripple stimuli tends to suppress the degree of non-functional synchrony, apparently preserving those correlations that are functionally relevant. These observations have profound consequences not only for studies whose measures of functional activity depend on correlated firing between thalamic and/or cortical neurons but for all studies in the ketamine- (and probably pentobarbitol-) anesthetized cat.

Israel Nelken and Omer Bar Yosef - Processing of Complex Sounds in Cat Primary Auditory Cortex

Primary auditory cortex (AI) is often considered to represent sounds in relatively simple terms. Neurons in AI are usually identified by their best frequency (BF), tuning curve bandwidth, binaural interaction class at BF, etc.. This view gained support from recent studies of the linearity of spectro-temporal integration in AI by Shamma and his coworkers. Here, we present data which suggests that such simple view is far from capturing the full complexities of spectro-temporal integration mechanisms of AI. We used two different paradigms for studying AI of halothane-anesthetized cats. In the first, natural sounds containing frequency-modulated sweeps (chirps) have been used to stimulate AI neurons. The natural sound segments contained echoes and background noise, which were separated from the main chirp by signal processing techniques. We found that the responses to the main chirp were modified in essential ways by the echoes and backgrounds. Interactions included occlusion, facilitation and suppression of the response to one component by the other components. These results cannot be explained by a linear spectro-temporal integration mechanism or simple modifications of such linear mechanisms. The second paradigm involved using stimuli usually used in psychophysical tests of Comodulation Masking Release (CMR). These stimuli included amplitude-modulated noise bands of various widths with and without a pure tone added to them. In many cases, adding a low-level tone to amplitude-modulated bands caused dramatic changes in the responses evoked in AI neurons. Low-level tones suppressed the envelope-following response typical to such stimuli; sometimes this suppression occurred at levels far below the threshold to pure tone in silence. Such strong interactions, which may be a physiological basis for CMR, are again highly non-linear in nature. We conclude that strong, qualitative non-linearities can be demonstrated in AI neurons provided naturalistic sounds are used, and that some of these non-linearities can be considered as specializations for the performance of important computational tasks, such as CMR.

Abdullah Ruhi Soylu, S. Yagcioglu and Pekcan Ungan - Searching for the Differences between the Cortical Sites Processing Different Auditory Stimuli: Dipole Source Localization by Means of Guided Random Techniques.

Localizing the generators of human evoked scalp potentials is one of the main goals of present day electroencephalography. In general, there are more than one solution to the source identification problem. When head geometry is modeled by three concentric spheres of different conductivities to represent brain, skull and scalp and sources by current dipoles, parameters of the model (localizations and orientations of the dipoles) can be estimated using a suitable optimization procedure that is expected to minimize the difference between measured and calculated scalp potentials.

Most frequently used optimization techniques for this purpose may be classified into two main groups: Gradient based and enumerative schemes. Gradient based search algorithms converges to the global minimum if the initial guess is in its local quadratic neighborhood. Enumerative techniques such as the simplex method are in general more adaptive than the gradient based ones but both of these techniques are frequently trapped in local minima of ill-defined or multimodal objective functions, which is the case when more than one or two dipoles are assumed in the above mentioned model.

More robust optimization procedures which may be called as guided random search algorithms are increasingly used to deal with complicated cost functions. Simulated annealing and genetic algorithms are two of these approaches. Simulated annealing is a cooling procedure where the stochastic nature of the search is gradually limited as the global solution is approached. Genetic algorithms use the machinery of biological evolution to transform a randomly generated population of candidate solutions towards a highly evolved one.

In order o localize the generators of the human long latency auditory evoked potentials as recorded from the scalp using 124 electrodes, simulated annealing and genetic algorithm are used in this study along with singular value decomposition which is much more efficient for estimating linear parameters for the final goal of discriminating the cortical processes activated by different auditory stimuli. Some preliminary results are presented.

Konstantina M. Stankovic - A Method for Evaluation of Multiparameter Nonlinear Models Illustrated on a Computational Model of Auditory-Nerve Rate-Level Curves

Multi-parameter nonlinear models are commonly used in the field of computational neuroscience. To critically evaluate such models it is necessary to perform a sensitivity analysis of model parameters for variois data sets. A small value of the criterion function at a point in parameter space does not necessarily correspond to a good parameter selection. For example, certain parameters in a model may be superfluous, so that a numerical convergence of the optimization procedure does not render reliable (or even meaningful) estimates of these parameters. Thus identification of the parameters that can be reliably estimated in a given model becomes critically important.

Traditionally, sensitivity analyses have been performed using iterative simulations, which are computationally intensive -- especially for models with a large number of parameters -- and offer little insight . In 1992, V\'{e}lez-Reyes developed a systematic and efficient technique for sensitivity analysis of model parameters using component-wise condition numbers for nonlinear least squares problems. We have recently extended the V\'{e}lez-Reyes method to include estimation of the standard deviation of the well-conditioned parameter values through calculation of the covariance matrix. Thus far, the V\'{e}lez-Reyes method been used in electrical engineering (for applications in electrical machines and in remote sensing), but is also applicable in computational neuroscience.

Here, we illustrate the usefulness of this technique using a multi-parameter nonlinear model commonly employed in the field of computational hearing: The model of auditory-nerve fiber (ANF) rate-level (RL) curves proposed by Sachs and Abbas (1974, 1989). The model assumes that the dependence of firing rate, $R$, on sound-pressure level at the tympanic membrane, $P$, is characterized by four parameters [not displayed here]

We applied the extended V\'{e}lez-Reyes method to estimate model parameters for 99 RL curves from cat ANFs in response to tones at the fiber's characteristic frequency. We show that the four parameters of the model were always fit reliably. Standard deviations in the estimated parameters are typically small, except for $\theta_I$, for which the standard deviation depends strongly on the type of saturation in the RL curve (being small for sloping saturation" and large for flat saturation").

When compared with common practice, the extended V\'{e}lez-Reyes procedure offers two distinct advantages in model-parameter identification based on nonlinear least squares: (1) It determines the subset of parameters that can be reliably estimated from the data, thus improving numerical conditioning of the optimization problem, and (2) It predicts the likely ranges of parameter variation, with direct implications for convergence criteria.

F. Tennigkeit, E. Puil and D. Schwarz Mechanisms responsible for the variability of output signals in neurons of the medial geniculate body

Neuronal responses to sound stimuli are commonly used to interpret the function of the auditory system in computer models. A discrete transfer function usually is an inadequate representation of neuronal signal transformation because membrane properties are highly variable in the central nervous system. As an example we present a survey of factors affecting the signal generation in neurons of the ventral partition of the medial geniculate body (MGBv), studied in a slice preparation with the whole cell recording technique.

MGBv neurons represent a current input faithfully in the rate of tonic firing if held at relatively depolarized membrane potentials, similar to the resting potential during alertness. In this tonic firing mode the transfer function is subject to regulation by transient and sustained K+ currents and a persistent Na+ current. At the hyperpolarized potentials that characterize states of sleep, however, the neurons respond to depolarizing inputs with phasic onset bursts, consisting of a low threshold Ca2+ spike (LTS) crowned by action potentials and high threshold Ca2+ spikes. The burst magnitude is regulated by K+ -, persistent Na+ - and inwardly rectifying currents and also occurs as rebound from hyperpolarizing stimuli. The LTS in this burst mode supports oscillatory behaviour which is evident as a resonance in the frequency response representing the impedance (Z) amplitude profile (ZAP) of the neuron. The low resonance frequency of MGBv neurons (1 - 2 Hz) matches the free oscillations in thalamo-cortical circuits observed as sleep or absence spindles. At depolarized potentials the ZAP function of MGBv neurons shows low pass filter characteristics.

Under isoflurane anesthesia voltage responses are shunted in both firing modes by a leak conductance involving mainly K+ as ionophore. Thus signal transformation differs between anesthesia and normal states, such as alertness and sleep.

It is well known that the intrinsic behaviour of thalamo-cortical neurons is subject to change by neuromodulation, e.g. through cholinergic and noradrenergic projections. We studied the influence of metabotropic glutamate receptors that probably mediate a cortico-thalamic modulation. Activation of these receptors with 1S,2R-ACPD led to a G-protein mediated, TTX-resistant Na+ current, complemented by an outward current in the depolarized range and inhibition of an inward rectifier in the hyperpolarized range.

We also studied the activation of GABAB receptors which may mediate a modulatory influence from the inferior colliculus and thalamic reticular nucleus. Application of the GABAB-receptor agonist baclofen hyperpolarized MGBv neurons by a G-protein mediated activation of a K+ current. The current shunted voltage responses and eliminated the resonance and, therefore, the intrinsic frequency selectivity of MGBv neurons.

### Cochlear Impants

I.C. Bruce, M.W. White, L.S. Irlicht, S. J. O'Leary and G.M. Clark - Advances in Computational Modeling of Cochlear Implant Physiology and Perception

Historically computational models of electrical stimulation of the cochlea have been unable to explain or predict a number of basic perceptual phenomena found in cochlear implant users. This could be due to either (i) an imperfect understanding of auditory nerve activity resulting from electrical stimulation of the cochlea, and/or (ii) an incomplete understanding of the relationship between auditory nerve activity and perception. In this study we investigate the former of these two possibilities. In particular, we hypothesize that stochastic (random) auditory nerve activity, which is present in responses to electrical stimulation but has been ignored in most previous models of cochlear implant physiology and perception, may largely account for many of these discrepancies. We have developed two different computationally efficient models of electrical stimulation, one incorporating stochastic activity in single auditory nerve fibers (stochastic model) and the other ignoring it (deterministic model). Using a standard model for relating auditory nerve activity to loudness perception, predictions were made of threshold, intensity difference limen and dynamic range. The results indicate that the stochastic model consistently gives better predictions of perceptual data than the deterministic model. In conclusion, understanding of the functional significance of auditory nerve responses to electrical stimulation of the cochlea is improved by consideration of stochastic activity.

John Hewitt - Modeling Cochlear Implant Stimulation of the Auditory Nerve

Cochlear prostheses are designed to restore hearing in patients with profound sensorineural deafness by stimulating auditory neurons with electrodes placed on or within the cochlea. Stimulation strategies which rely on digitization of the stimulus amplitude envelope on several bandpassed channels do not properly account for the refractory period and adaptation of the auditory neurons. A Neural Sum Modulation (NSM) strategy has been propsed (Parkins et al, 1983) in which the number of neurons responding to electrical stimulation could be made proportional to the stimulus amplitude envelope by applying an adaptation filter which probabalistically accounts for adaptation and refractory period to the stimulation pulse train. This method requires accurate knowledge of the electric potential generated by the implant at the excitable region of the auditory nerve, as well as knowledge of how the nerve responds to this potential. To this end a three-dimensional electric field model of an implanted helical cochlea was made using the boundary element method, a method related to the finite element technique. The electric field values at the approximate locations of the nodes of Ranvier in the 8th nerve were then used in an auditory nerve model to evaluate the neural responsiveness to various patterns of stimulation.

### July 6 Bat Audition

Rolf Müller and Hans-Ulrich Schnitzler - Using Computational Insights for Devising Experimental Research Strategies on Acoustic Flow Perception in CF-Bats

Please see page 43 of the ASI Proceedings.

Mark I. Sanderson and James A. Simmons - Single and Multiunit Encoding of FM Sweeps in the Big Brown Bat (Eptesicus Fuscus) Inferior Colliculus

Please see page 439of the ASI Proceedings.

Janine M Wotton, Michael J Ferragamo, Rick L Jenison and James A Simmons - Time and frequency information used in the computation of elevation by an echolocating bat

The big brown bat, Eptesicus fuscus, emits broadband, frequency modulated (FM) echolocation signals and uses information contained within the echoes to find and catch insects in flight. The information echolocating bats acquire is a combination of the properties of the sound they emit and the sound they receive at the eardrum. Pinnae act as spatially dependent filters, such that changes in sound source position result in systematic spectral changes in the acoustical transfer functions measured at the eardrum. The potential localization cues available to bats are contained in the combination spectra produced by convolving the magnitude spectra of the emission and the external ear. Localization cues appear to be enhanced in the combination, the peaks are sharpened and there is a greater contrast in intensity between peaks and notches. A backpropagation network model localizes more accurately when provided with combination spectra compared to either emission or ear information alone. The introduction of a notch in a phantom target "echo" influences bats' discrimination ability for echoes from different elevations. These acoustic effects can just as easily be thought of as events in the time waveform that the bat receives at the eardrum. Bats use temporal information to determine the range and shape of targets and have shown a remarkable acuity for measuring time delay. Changes in the elevation of phantom targets influences the bats' performance in the psychophysical jittered-echo paradigm used to measure time delay. We explore the possibility of temporal coding in the bat's perception of space.

### Sound Localization and Binaural Mechanisms

Michael A. Akeroyd and A. Quentin Summerfield - A Computational Model of the Lateralization of Dichotic Pitches

A "dichotic" pitch is the sensation of pitch generated through binaural interaction. The existence of dichotic pitches was first described by Cramer and Huggins (1958, JASA, 30, 413-7), who introduced an interaural phase transition of 360 over a narrow range of frequencies in a diotic white noise. The sound of the noise was always heard, but when, and only when, both ears were stimulated listeners also heard a tone whose pitch corresponded to the center frequency of the interaural phase transition. The tone has an ambiguous lateralization: some listeners hear it to both the left and right sides of the head, while others hear it to only one side. An understanding of dichotic pitches may help identify the processes through which the frequency and position of weak narrowband sound sources are judged in the context of more intense broadband sources.

The interaural phase transition in the Cramer-Huggins stimulus disrupts the interaural correlation of the noise over a narrow frequency band, giving a peak in a profile of interaural de-correlation across frequency. The pitch is predicted accurately by the frequency of this peak. Consequently, it is possible to create a dichotic pitch simply by setting to zero the interaural correlation of a narrow band of frequencies in an otherwise correlated (diotic) noise. This poster reports measurements of, and computational predictions of, the lateralizations of such dichotic pitches. We interaurally decorrelated a 80-Hz-wide band of a 1200-Hz lowpass-filtered, diotic noise, and then interaurally time-delayed the ensemble by 833 s, so that the diotic noise was lateralized on the right side of the head. In isolation, the interaurally-decorrelated narrow band has a diffuse lateralization, but in the context of the noise it is heard as a tone with a definite pitch and a precise lateralization. Moreover, its lateralization depends upon its center frequency: it is heard on the left at 400 Hz, on the midline at 600 Hz, and on the right at 800 Hz. If, however, the time-delayed noise is replaced by an anti-phasic noise (the noise waveform is inverted at one ear), then the lateralization of the dichotic pitch is fixed and does not depend on center frequency.

We are exploring the ability of three computational models of binaural processing to account for these results. The peripheral processing simulated by the models is the same: matched left and right gammatone filterbanks, followed by halfwave rectifiers, log compressors, and a binaural cross-correlator. The lateralization of the dichotic pitch is predicted from the across-time-delay pattern of correlation in the frequency channel centered upon the decorrelated narrow band. Initial tests have shown that equating the lateralization to the peak of the correlation pattern (Shackleton et al., 1992, JASA, 91, 2276-79) or the centroid of the pattern (Stern et al., 1988, JASA, 84, 156-65) does not predict the full range of lateralizations. A third model, derived from Raatgever and Bilsen (1986, JASA, 80, 429-41), yields more accurate predictions. The corrrelation pattern for the dichotic pitch is compared to the pattern for the noise alone, and the point of maximum difference is equated to the lateralization. The lateralization thus traces valleys in the correlation pattern of the noise and, since the location of the valleys changes with frequency in a time-delayed noise but is fixed in an anti-phasic noise, the model can predict the observed lateralizations. At the meeting we shall report further tests of the generality of this account of the lateralization of dichotic pitches.

Douglas S. Brungart - Preliminary Model of Auditory Distance Perception for Nearby Sources

Recent measurements of the Head-Related Transfer Function (HRTF) at source distances less than 1 m have verified that the interaural intensity difference (IID) increases substantially as a nearby source approaches the head, while the interaural time delay (ITD) remains roughly independent of distance. Therefore it is theoretically possible for a listener to determine the distance of a nearby sound source by first determining its lateral position from the ITD and then estimating its distance based on the magnitude of the IID. This paper describes a simple model of this localization process based on previously measured values of the just-noticeable differences in ITD and ITD for a 500 Hz tone. The general predictions of this model are consistent with the data from a preliminary near-field localization experiment, which showed that auditory distance perception was significantly more accurate for lateral sources than for sources in the median plane. The model also predicts that the threshold percentage change in distance will decrease as distance decreases, although no psychoacoustic data are currently available to confirm this prediction.

Klaus Hartung and Susanne J. Sterbing - A physiology-related model for the localization of sound sources

Electrophysiological investigations in the midbrain of the guinea pig (central nucleus of the inferior colliculus, ICc) revealed, that more than 90% of the neurons were spatially tuned when stimulated with broadband virtual sound sources (VSS). Most of the neurons preferred lateral positions, but tuning to frontal or rear directions and to different elevations was also observed. The majority of neurons did not change their best position over a wide dynamic range. Stimulation with VSS signals of different bandwidth and center frequency showed that the spatial tuning characteristic of ICc neurons changed in comparison to the broadband stimulation. The size of the receptive fields differed and/or front/back ambiguities occurred frequently. Based on the individual head-related transfer functions of each animal the interaural level differences (ILD), interaural time differences (ITD) and the monaural directivity was calculated in 1/3 octave bands for the upper hemisphere. It was assumed, that the neurons received input from ILD and ITD processors and from monaural pathways. The relative weights of each input were estimated by a least square approximation of the neuronal response. These weights were different for each of the tested neurons. High weights were found for ILD cues at the characteristic frequency (CF) of the neuron or close to CF. Based on this single neuron model a localization model using a population of neurons which were tuned to different directions was tested in a localization task. The model allowed a robust estimation of the direction of the sound source.

Kazuhito Ito and Masato Akagi - A Computational Model of Auditory Sound Localization

A function model of auditory sound localization based on the inter-aural time difference(ITD) is presented. In this model, signal patterns in the orgnism, such as nervous impulse or synaptic transmissions, are represented computationally according to biological knowledge, and these patterns are applied to the coincidence detector circuits of ITD. The results of the simulation using only the coincidence detector circuits show that firings of one coincidence detection spread over the circuits, even in response to just one pair of stimuli from ears. Thus, it is difficult to determine the actual ITD using the coincidence detector circuits only. To determine ITD with more accuracy, the existance of inhibitory neurons in higher order nuclei projected from the circuits is assumed. Since the coincidence detector indicating the actual ITD tends to fire earlier than others, the first firing event at the actual ITD excits its own postsynaptic neurons and inhibits other ones. Consequently, the model with inhibition can improve accuracy of detecting ITD.

Brian L. Karlsen - A Probabilistic Localization Model

This poster will present the framework for a new kind of horizontal localization model. The primary idea behind the model is that humans do not localize sound sources categorically, but that several alternatives exist simultaneously. This would especially be the case when more than one source was present, but also in the case of front-back confusion could it be argued that several competing alternatives might be present. The model takes its input from the haircell firing probabilities of a nonlinear model of the auditory periphery. On the basis of this, three probabilistic maps are computed: Interaural time difference (ITD), interaural intensity difference (IID) and spectral estimation. These maps are 3D with two dimensions being time and azimuth angle and the third dimension being the probability of a source being present at this particular angle at this particular time. The maps are then multiplied to generate a combined probabilistic map. This combined map can then be integrated over time to yield a final probability distribution of the sources with respect to the azimuth angle.

Daniel J. Tollin - Computational Model of the Lateralisation of Clicks and their Echoes

A computational model was developed to describe the results of psychophysical experiments that investigated the precedence effect with clicks. The model includes several processing steps: First, the signals to the two ears are passed through an array of bandpass filters that simulate the frequency selectivity of the peripheral auditory system. The center frequencies of the specific array of filters was determined by an algorithm which considered both the spectral characteristics of the signals to each ear as well as a "dominance region" around 750 Hz where the interaural characteristics of the signals are particularly effective in lateralisation. The filtered signals at each ear are then half-wave rectified and the resultant patterns taken to determine the post-stimulus-time histograms of an auditory nerve fiber's response to a click. It is assumed that the auditory filters are linear and time invariant within the small relevant operating range. The simulated temporal responses for each ear's signal provide the direct input to an interaural cross-correlation device and a device that extracts information about interaural differences in amplitude. A correlate of the subjective intracranial lateral position produced by the acoustical stimuli is estimated from a linearly weighted sum of the information from the interaural delay and interaural amplitude processors taken across the array of filters. Lateralisation discrimination "thresholds" are computed in the model by varying the interaural parameters of the stimuli until the lateral position estimate reaches a predetermined threshold lateral position. The predictions of the model are consistent with many of the major trends in the psychophysical data. The model provides insights into mechanisms potentially responsible for the precedence effect with clicks. Predictions from other, non-computational models, will be discussed along with their limitations.

S. Yagcioglu, P. Ungan - A Computational Model for Neural Coding of Interaural Time Disparities Based on The Ie-Units in the Lateral Superior Olive

A computational model based on the inhibitory-excitatory (IE) units in the lateral superior olive (LSO) is proposed to simulate the interaural time difference (ITD) dependent attenuation and latency shifts observed in the wave DN1, the earliest and most prominent component of the binaural difference-potential (BDP) in the cat auditory brainstem responses (ABRs). The BDP, which is computed by subtracting the sum of two monaural ABRs from the binaural one, is considered to be an indicator of binaural interaction (BI) in the brainstem nuclei responsible for sound lateralization. The model stands as an alternative to the delay-line coincidence detector models based on excitatory-excitatory (EE) units in the medial superior olive (MSO) predictions of which are not compatible with the way BDP latency depends on ITD.

The model assumes that the afferent impulses elicited by a click arrive at each unit with a delay which is a random variable with a normal (Gaussian) distribution. The firing of an IE unit is only possible if the excitatory impulse reaches the unit either before the inhibitory one due to contralateral stimulation or after the inhibition already caused by an earlier impulse has extinguished. The probability distribution of the latency of this firing is expected to determine the waveform of the component P4 of ABR which is the far-field recorded synchronized action potentials (APs) ascending via the fibers of bilateral lateral lemnisci (LL) and has a latency congruent with DN1. By estimating the contributions of the two sides of the brainstem to P4, we were able to investigate the changes in DN1 amplitude and latency due to ITD.

Parameters of the model are the relative timing of the ipsi- and contralateral inputs to LSO neurons, variability of conduction times of the fibers bringing excitatory and inhibitory impulses and the duration of inhibition in a typical LSO cell due to a contralateral click. Representative values for these parameters were obtained from the literature and adjusted so as to yield a better fit for the ITD-dependent latency and magnitude changes observed in BDP. The model was quite successful in predicting the experimental data, supporting the hypothesis that the LSO can as well act as an ITD encoder for transients besides its recognized function of encoding interaural level differences for steady sounds.

### July 8 Timing and Periodicity

Stefan Bleeck and Gerald Langner - Temporal Processing in the Auditory System: Latency as a Possible Carrier of Information

Please see page 83 of the ASI Proceedings.

E. Blumschein, J. Blumschein and V. Meyer - A Model of Hearing Based on Gradual Forgetting for Coincidence Detection

Please see page 89 of the ASI Proceedings.

Peter Cariani - Neural Timing Nets for Auditory Computation

In many lower auditory stations, the time structure of the acoustic stimulus is impressed upon the timing of neural discharges. To the extent that this occurs, the all-order interval distributions of individual auditory neurons (auditory nerve, cochlear nuclei, subpopulations of inferior colliculus) resemble the autocorrelation function of the filtered stimulus. When these intervals are summed over populations including units with many best-frequencies, patterns of major peaks in the population-interval distributions show strong correspondences with a wide variety of human pitch judgments.

Current models of the perception of pitch generally assume that stimulus-driven time-patterns are converted to spatial excitation patterns via the differential activation of periodicity-tuned units in the ascending auditory pathway (e.g. the inferior colliculus). In effect, an explicit measurement of periodicity is carried out. For pitch matching tasks, such spatial patterns are assume to be stored in a Such explanations run into potential problems, however, when the characteristics of periodicity-tuned units are considered. These include the broadness of modulation-tuning, the loss of such tuning at higher sound pressure levels and in noise, the systematic shift to lower and lower best-modulation-frequencies as one ascends the pathway, and the incongruence between modulation-based representations and human pitch judgements, which instead follow the stimulus autocorrelation.

Alternatively, to the extent that timing information is preserved in central auditory stations (e.g. perhaps sparsely distributed over greater and greater numbers of neurons), purely temporal processing strategies are possible. At the level of the cochlear nucleus and inferior colliculus, population-interval representations do not appear to be degraded at higher levels, and continue to correspond well with stimulus autocorrelation functions (and to pitch judgments). It remains to be seen how much timing information is available at thalamocortical stations (I will bring whatever evidence and data I have by then that bears on this question).

We will discuss how neural timing networks consisting of coincidence detectors and tapped delay lines can carry out temporal analyses (cross-correlations, auto-correlations, convolutions, anticorrelation/cancellation) on time structured inputs to extract similarities and differences. Networks with many recurrent, reverberant delay paths can store incoming patterns as temporal memory traces that are continuously cross-correlated with subsequent incoming ones, thereby building up sets of temporal expectations that do not require explicit measurement of time intervals (i.e. no internal clocks, periodicity detectors or highly tuned delay elements). In such reverberant networks the input and output signals are themselves time patterns (e.g. ensemble all-order interval statistics), such that the high precision of the representation continues to reside in spike timings, rather than in which neurons respond how much.

[Supported by NIH NIDCD Grant DC03054]

P. Heil - Some New Ideas on Envelope Coding in the Auditory System

Please see page 101 of the ASI Proceedings.

Hynek Hermansky and Misha Pavel - RASTA Model and Forward Masking

Please see page 107 of the ASI Proceedings.

S.L. McCabe - Cortical Synaptic Depression and Auditory Perception

The identification of the order of individual signals within a sequence of signals is surprisingly difficult if the signals are presented rapidly (Warren, 1982). Although it is possible to recognise and discriminate between the sounds of different orderings, so-called 'temporal compounds', it is generally not possible to name the order of the individual signals if signal durations fall below about 200 ms. It has been shown that when pure tone signals are used, performance in this task improves as the frequency difference between them increases and preliminary experimental work suggests that spectral overlap may similarly affect performance when complex stimuli are used. We propose the hypothesis that synaptic depression may give rise to the experimentally observed spectral interference effects in Temporal Order Identification (TOI) tasks and use simulations of a recently developed model of synaptic dynamics to explore and illustrate these ideas.

Brian P. Strope and Abeer A. Alwan - Modeling the Perception of Pitch-Rate Amplitude Modulation in Noise

Currently, most automatic speech recognition systems integrate spectral estimates over multiple pitch periods and remove explicit pitch and voicing information. However, amplitude modulation cues in voiced speech provide a robust and salient pitch perception which may be instrumental for recognizing speech in noise. In this study, three psychoacoustic models are used to predict the temporal modulation transfer function (TMTF) and the detection of voicing for high-pass filtered natural fricatives in noise. Models using an envelope statistic and modulation filtering can predict the TMTF data, while predictions from a model using a summary autocorrelogram approximate both data sets.

R.W. Ward Tomlinson and Gerald Langner - Temporal Processing in the Auditory System: The Functional Significance of Neural Noise

The temporal behavior of computer simulations of neurons were investigated at different levels of noise. The model neuron consisted of three passive compartments and one active set of Na and K channels modeled after cortical pyramidal neurons. The only variables were the level of background noise and the stimulus period. All other parameters were fixed at known physiological values. The unit firing rate and phase-locking to different stimulus periods was measured. Three different regimes were defined that depended on noise level, stimulus intensity and threshold.

Masashi Unoki and Masato Akagi - A Computational Model of Co-Modulation Masking Release

Please see page 129 of the ASI Proceedings.

### Pitch

Julyan H.E. Cartwright, Diego L. Gonzalez, and Oreste Piro - Pitch Perception of Complex Sounds: Nonlinearity Revisited

Please see page 141 of the ASI Proceedings.

Yidao Cai, JoAnn McGee and Edward J. Walsh - Processing of Pitch Information in Complex Stimuli by a Model of the Octopus Cell in the Cochlear Nucleus

A model of the octopus cell in the posteroventral cochlear nucleus was used to study the processing of pitch information in complex harmonic stimuli. As described previously (Cai et al., J. Neurophysiol. 78, 872-833, 1997), the model has a soma and an axon, both containing active channels, and four identical, passive dendrites. The axon and soma are each represented by a single compartment and each dendrite is represented by 20 compartments. The axon compartments contain the Hodgkin-Huxley-like Na and K channels, which are responsible for initiating action potentials. The soma compartment contains two additional active mechanisms: a low- threshold, 4-AP sensitive K channel and a Cs-sensitive, hyperpolarization- activated inward rectifier. The inputs to the model were auditory-nerve fiber spike trains, recorded from anesthetized cats. In addition to tone bursts of different frequencies, harmonic or inharmonic complexes, similar to those used in psychophysical studies, were used to collect the auditory-nerve data. The inputs, all excitatory and with dynamics of an alpha function, were applied to different locations on the model. With a tone burst of 500 Hz, the output of the model exhibits a strongly phase-locked response; with tones of higher frequencies, the model exhibits On-I or On-L PSTH patterns, depending upon model parameters. In response to three-component (800, 1000 and 1200 Hz) or six-component (1000-2000 Hz with 200 Hz spacing) harmonic complexes, the model produces sharply defined peaks in the PSTHs at every cycle of the fundamental component, regardless of the presence of the fundamental component (200 Hz). These results are expected, since psychophysical studies have shown that the fundamental component is not essential for pitch perception. However, when the fundamental component is not present in the stimuli, there are higher peaks in the steady-state portion of the PSTHs and more tightly distributed peaks in the ISIHs. With either a frequency-shifted version of the three-component complex (850, 1050 and 1250 Hz) or a six-component harmonic but random phased complex as stimuli, the model produces less synchronized and weaker responses during the steady-state portion of the PSTHs, and wider peaks in the ISIHs. In the case of the frequency-shifted version of the three-component complex, the average interspike interval also decreases slightly. Psychophysically, an amplitude-modulated tone (a 1000-Hz tone modulated by a 200-Hz tone) and its inverted version produce the same pitch. The model responses to these two stimuli are also similar, although the inverted version results in a slightly wider peak in the ISIH. Overall, response patterns obtained using complex stimuli are less sensitive to changes of model parameters than responses obtained using pure tones. In summary, the model results are basically consistent with the hypothesis that interspike interval is a correlate of pitch, with the exception of the results produced by random phased stimuli.

Work supported by NIDCD DC01007. Yidao Cai is supported, in part, by NIDCD P60 DC00982-06.

Martin F. McKinney and Bertrand Delgutte - Correlates of the Subjective Octave in Auditory-nerve Fiber Responses: Effect of Phase-locking and Refractoriness

The primary aim of our research is to understand the neural representation of musical pitch. Our general strategy is to correlate frequency estimates based on neural responses with human pitch judgments. We have previously shown that a model operating on auditory-nerve (AN) interspike intervals (ISIs) quantitatively predicts the octave enlargement effect, i.e., listeners' preference for octave ratios slightly greater than 2:1 (McKinney and Delgutte, ARO abstracts, 1995). These predictions result from biases in frequency estimates caused by small, but systematic deviations in the ISIs from multiples of the stimulus period. Two types of ISI deviations exist: For stimulus frequencies less than 400 Hz, ISIs are slightly smaller than multiples of the stimulus period; for stimulus frequencies greater than 400 Hz, ISIs are slightly greater than multiples of the stimulus period. The goal of the present research is to elucidate the causes of these ISI deviations. Using both data and a model, we show that the two types of deviations are caused by different phenomena.

All analyses were performed using both physiological (1) and simulated (2) AN responses. Physiological single-unit responses were obtained from the AN of Dial-anesthetized cats for pure-tone stimuli. Simulated AN responses were synthesized using a multiplicative point-process model for AN excitation which predicts a wide variety of physiological behavior (Johnson and Swami, J. Acoust. Soc. Am., 74:493-501, 1983). The model instantaneous discharge rate is the product of two components, one representing the stimulus drive and the other representing the refractory properties of the fiber. ISIs were characterized for both model and physiological data and deviations from multiples of the stimulus period were measured.

In response to low-frequency (<400 Hz) pure-tone stimuli, AN ISIs tend to be shorter than multiples of the stimulus period. This is a direct result of the response being phase-locked to the stimulus and is caused by the fiber discharging twice in the same half-period of the stimulus. ISIs preceding and following the spikes within the same half-period are shorter, on average, than the stimulus period. This results in biases in the modes of the ISI distributions for low-frequency stimuli. Conditioning a spike sequence to eliminate all ISIs preceding and following the sets of two spikes within the same half-period results in an ISI distribution with no biases in the modes. The multiplicative model quantitatively predicts this effect.

An opposite deviation exists for AN ISIs in response to mid-frequency pure-tone stimuli. ISIs in response to tones between 400 Hz and 3000 Hz are slightly larger than multiples of the stimulus period. The physiological data show that the AN fiber fires at a later phase (re stimulus) when closely preceded by a spike. This phase delay for short ISIs is consistent with the idea that conduction velocity is slower during the relative refractory period and leads directly to biases in the modes of ISI distributions. The multiplicative model predicts small phase delays and small ISI deviations but the predicted deviations are much smaller than those in the physiological data. Thus the multiplicative model does not accurately represent the refractory properties of AN fibers. Of course, there may also be some other cause besides refractoriness, such as synaptic events, for the ISI deviation.

These results suggest that computational models of pitch based on ISIs may have to simulate detailed statistical properties of AN fibers in order to correctly predict pitch effects such as the octave enlargement.

Supported by Grants DC02258 and DC00038 for the NIDCD, National Institutes of Health

Mark I. Sanderson and Andrea M. Simmons - Neural Coding of the Pitch of Narrow-Band Signals

Responses of bullfrog eighth nerve fibers to 2 and 3 component narrow-band stimuli were examined. Harmonic structure and relative amplitude of stimulus components varied. Fibers with best frequencies above about 500 Hz (high AP and BP fibers) systematically extracted the stimulus period of harmonic stimuli in their patterns of phase-locked responses. When stimuli were inharmonic, peaks in interval histograms of fiber responses were near or around the period of the frequency spacing ((f) of sidebands. Phase locking was captured primarily by (f of the sideband closest to fiber BF. Low AP fibers showed a greater diversity of responses to both harmonic and inharmonic stimuli, and phase-locked to stimulus components, (f of sidebands, and distortion products. Data are discussed in relation to a sandwich model of periodicity extraction, and to a model based on an envelope-weighted average of stimulus instantaneous frequency.

W.A. Yost and D. Mapes-Riordan - Computing Summary Correlograms

Several computation models make use of autocorrelation or similar mechanisms to extract information about temporal regularity in a sound waveform. In many cases the autocorrelation information is summed across spectrally-tuned channels to produce a summary correlogram for use as a primary decision statistic. In most cases the summary correlogram is computed across all tuned channels in the model or across those channels that represent the bandwidt h of the stimulus. Such computations are often done without regard to any weighting of the information from the various channels. For instance, the summary correlogram for a harmonic series is computed based on all channels in the model that are within the spectrum of the sound and the peaks in the summary correlogram are used to describe the pitch of the harmonic sound. Meddis and Hewitt (JASA 89, 1862-1882, 1991) have shown that this approach of using sum mary correlograms can do an excellent job of accounting for a large range of pitch ph enomena. The pitch and pitch strength of Regular Interval Stimuli such as Iterated Ripple Noise have been modeled with the use of computational models and summary correlograms (see Yost, Patterson, and Sheft, JASA 99, 1066-1078, 1996; or Patterson, Handel, Yost, and Datta, JASA 100, 3286- 3294, 1996). Several versions of iterated ripple noise produce pitches that cannot be accounted for based on unweighted, wideband summary correlogram computations. These pitches and their pitch strengths can be accounted for if a narrowband computation is used. This poster will describe the stimulus conditions that produce these pitches. We will further des cribe the type of narrowband computations that appear to be required in order to use summary corre lograms to predict the pitches and their pitch strengths. We will also suggest mechanisms ( e.g., lateral inhibitory processes) that might be used to weigh the information before a summary correlogram is computed.

### July 10 Speech

M. Brucke, W. Nebel, A. Schwarz, B. Mertsching, M. Hansen, B. Kollmeier - Digital VLSI-Implementation of a Psychoacoustically and Physiologically Motivated Speech Preprocessor

I will present some of our work on the VLSI-implementation of a psychoacoustical preprocessing model. The model consists of several stages motivated by the signal processing in the human ear and was successfully applied to a wide range of psychoacoustical experiments by the medical physics group at the university of Oldenburg. The first stage of the model is a 30 channel gammatone filterbank with center frequencies from about 70 Hz to 6.7 kHz simulating the filtering by the basilar membrane. Each channel is followed by a halfwave rectification and a 1 kHz lowpass to model the haircell transduction. After that the signals are put into a chain of five nonlinear adaptation loops with different time constants. Stationary input will be compressed nearly logarithmic while fast changes are transformed more linearly. The resulting signals are then lowpass filtered with a time constant of 20 ms to account for effects of temporal integration.

I will show some of the problems (and our solutions) which arrive when implementing a "well known'' piece of software on a chip. One problem for example is the necessary precision of the calculation which determines the width of the internal registers on the chip. For LTI-systems like the filterbank it is easy to grant numerical stability, but the model has nonlinear stages. We developed a method to find the necessary internal precision by observing the distortions of transimitted speech through quantization errors using an objective speech quality measure. The disadvantage of this method is the enormous amount of data (several Megabytes) needed for simulation which is not practicable for the daily work. In my work I try to develope a method based on a psychoacoustical model and well non "psychoacoustical'' test sounds (notched noise, sinewaves, etc.) which needs less data and is more practical for the daily work of a hardware designer.

Piero Cosi - D, DD, DDD, DDDD...... Evidence Against Frame-Based Analysis Techniques

The need of D, DD, DDD, DDDD...... measures is a clear sign of the loss in the representation capability of classical frame-based analysis techniques. In fact, almost every acceptable ASR system is forced to introduce this kind of post-processing technique, in order to obviate to that loss. Following previous work on Auditory Modeling (AM) techniques for speech analysis front-end for automatic speech segmentation (ASS) [1-3] and automatic speech recognition (ASR) [4-6], evidences against frame-based analysis techniques, thus against the need of D, DDÖ, will be given and exploited in this paper. Various examples, mostly on plosives or other non-stationary consonants, will be illustrated, with the aim of underlying the superiority of ìsampling after processing relatively to ìframing before processingî in speech segmentation and speech recognition tasks.

References
[1] Cosi P., "SLAM: Segmentation =and Labelling Automatic Module", in =20Proceedings of EUROSPEECH-93, 3rd European Conference on Speech =Technology, Berlin, Germany, 21-23 September, Vol. 1, 1993, pp. 88-91.
[2] Cosi P., "SLAM: a PC-Based Multi-Level Segmentation =Tool", in Speech Recognition and Coding. New Advances and =Trends, A.J. Rubio Ayuso and J.M. Lopez Soler edts, NATO ASI Series, Computer =and Systems Sciences, Springer Verlag, Vol. F 147, 1995, pp. 124-127.
[3] Cosi P., "Ear Modelling for Speech Analysis and Recognition" (1992), =Proceedings of "Comparing Speech Signal Representations", ESCA Tutorial =and Research Workshop, Sheffield, England, 8-9 April 1992; paper ISSN =1018-4554 (to be published in J. Wiley & sons L.t.D. book).

[4] Cosi P., Magno =Caldognetto E., Vagges K., Mian G.A. and Contolini M. (1994), "=Bimodal Recognition Experiments with Recurrent Neural Networks"=, Proceedings of IEEE ICASSP-94, Adelaide, Australia, 19-22 April, 1994, Vol. 2, Session 20.8, pp. 553-556.

[5] Cosi P., Dugatto M., Ferrero F., Magno Caldognetto E., and =Vagges K. (1995), Bimodal Recognition of Italian Plosives, Proc. 13th =International Congress of Phonetic Sciences, ICPhS95, Stochkolm, Sweden, =1995, Vol.4, 260-263

[6] Cosi P., Magno Caldognetto E., Ferrero =F.E., Dugatto M. and Vagges K. "Speaker Independent Bimodal =Phonetic Recognition Experiments", , Proceedings of ICSLP-1996, Philadelphia, PA USA, =October 3-6, 1996, Vol. 1, pp. 54-57.

Stuart Cunningham and Martin Cooke - Evidence and counter-evidence in human speech perception and automatic speech recognition: a perceptual/modelling study of auditory spectral induction.

Human speech perception is robust even in adverse conditions in which extraneous sounds mask large spectro-temporal regions of the speech signal. Auditory induction, the ability to 'fill-in' those portions of the speech signal which have been obliterated, may aid communication in noisy environments, yet few attempts have been made to incorporate induction strategies into Automatic Speech Recognition (ASR) systems. This poster will report on a joint perceptual/modelling study which explores the recent discovery by Warren and collegues [1] of auditory spectral induction. In the present study, listeners were presented with sequences of digits which had been filtered into two narrow bands centred at 370Hz and 6000Hz. Identification performance was measured as a function of the level of a noise-band placed in the spectral gap between these two cue bands. Results are used to inform the design of a computational model which employs missing data techniques [2,3] to recognise from cue bands alone, with and without counter-evidence provided by the noise-band.

[1] Warren, R.M., Hainsworth, K.R. Brubaker, B.S., Bashford, J.A., & Healy, E.W. (1997) 'Spectral resotration of speech: Intelligibility is increased by inserting noise in spectral gaps', Perception & Psychophysics, 59(2), 275-283.

[2] Cooke, M.P., Morris, A.C., & Green, P.G. (1996) 'Recognition of occluded speech', ESCA Workshop on the Auditory Basis of Speech Perception.

[3] Morris, A.C., Cooke, M.P. & Green, P.G. (1998) 'Some solutions to the missing feature problem in data classification, with applicaation to noise robust ASR', ICASSP'98.

Werner A. Deutsch and Bernhard Laback - Spectral Layers in Speech and Music Perception

A modification of the phase vocoder has been applied in order to split acoustic signals into spectral layers according to a masking and overmasking paradigm. Spectral components below the irrelevance threshold have been made audible. Overmasking has been introduced by progressively flattening the spreading function of the masking process. This results in obtaining two audible signal parts, one containing weaker components only, the second composed of spectral peaks mainly. In a certain range both parts of a speech signal are fully intellegible. In music, leading voices can be extracted and separated from the orchestra sound. Recent research indicates that this type of figure-background discrimination is preferred by listeners with sensineural hearing impairment. The paper discusses wether or not the procedure described is suitable to compensate for degraded frequency selectivity in cochlear impairement. Sound samples and spectrograms are presented.

Hideki Kawahara - Straight: An Extremely High-Quality Vocoder for Auditory and Speech Perception Research

A new set of simple procedures called STRAIGHT-suite (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrogram) has been developed to enable the real-time manipulation of speech parameters. The proposed method uses pitch-adaptive spectral analysis combined with a surface reconstruction method in the time-frequency region, and an excitation source design based on phase manipulation. It also consists of a pitch extraction method using instantaneous frequency calculation based on a new concept called 'fundamentalness'. The proposed procedures preserve the details of time-frequency surfaces while almost perfectly removing fine structures due to signal periodicity. This close-to-perfect separation allows for over 600\% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, without further degradation due to the parameter manipulation. Consequently, the proposed method is an ideal tool for investigating the perceptual correlates of acoustic speech parameters.

http://www.hip.atr.co.jp/~kawahara/STRAIGHT.html

Valerir Lloyd - The roles of working memory span, contextual and acoustic cues in spoken language comprehension.

The purpose of this research is to investigate the relative effects of working memory span (wms), sentence complexity, background noise, and contextual vs acoustic cues on how young, normal-hearing adult listeners recognize target words in spoken sentences.

Dillon (1995) found that, when normal-hearing young adults listened to sentences, their comprehension declined as the level of competing background noise increased, especially for more complex sentence types. It has been suggested that similar poor comprehension of complex sentences in aphasics (Caplan, 1985) may be a result of limitations in working memory resources rather than a loss of competence (Carpenter et al., 1995).

In order to test the hypothesis that poor comprehension is attributable to reduced wms, young normal-hearing subjects with high and low reading wms (Daneman & Carpenter, 1980) were administered a sentence comprehension task. Subjects were also required to complete an on-line memory task.

Preliminary results indicate that high wms listeners make fewer word recognition errors than low wms listeners when sentences are presented in high background noise. More specifically, it appears that contextual cues (word-frequency) are weighted more heavily in high noise and high sentence complexity conditions for high compared to low wms listeners, and that acoustic cues (voicing) also carry more weight for high wms listeners as background noise increases.

S.J.Makin and Guy J. Brown - Patterns of Confusions Made by Models of Double Vowel Identification: A Comparison with Human Data

It has been well established that the identification of a pair of simultaneous, steady-state vowels improves with a difference in fundamental frequency (dF0), increasing rapidly up to about 1 semitone and then asymptoting (Scheffers (1983); Assman and Summerfield (1990)). A number of physiologically-motivated, computational models of this experimental finding have been proposed in the past (Assman and Summerfield (1990); Meddis and Hewitt (1992); Culling and Darwin (1994)). For example, Meddis and Hewitt's ('92) scheme employs a bank of band-pass filters and inner hair cell models to simulate the auditory periphery. Autocorrelation functions (ACFs) are then computed for each channel and pooled by summing across channels. This pooled ACF is used to derive a dominant pitch estimate, and individual channel ACFs showing a pitch peak at this period are segregated from those which do not. The short time lag (timbre) region of the two separate pooled ACF's are then used to identify the two vowels using a template matching procedure. Although this model under-predicts listeners' performance for zero dF0, the overall match is impressive.

In fact, many other models of double vowel perception show this pattern of increasing identification up to 1 semitone, and most employ quite different schemes to do so (e.g Culling and Darwin's (1994) approach based on low-frequency beating; De Cheveigne's (1997) cancellation scheme; Brown and Wang's (1997) neural oscillators). A more discriminative measure of the models' performances would therefore be useful.

In the study reported here, we have investigated the pattern of confusions made by human listeners in a double vowel identification task. The poster reports our experimental findings, together with a modelling study which assesses the ability of the Meddis and Hewitt model to reproduce our confusion data.

References

[1]Assman, P. F., and Summerfield, Q. (1990). "Modelling the perception of concurrent vowels:: Vowels with different fundamental frequencies," J. Acoust. Soc. Am. 88, 680-697.

[2]Brown, G., and Wang, D. (1997). "Modelling the Perceptual Segregation of Double Vowels with a Network of Neural Oscillators," Neural Networks, Vol. 10, No.9, 1547-1558.

[3]Culling, J. F., and Darwinn, C. J. (1994). "Perceptual and computational separation of simultaneuos vowels: cues from low-frequency beating," J. Acoust. Soc. Am. 95, 1559-1569.

[4]De Cheveigne, A. (1997). " Concurrent vowel segregation III: A neural model of harmonic interference cancellation," J. Acoust. Soc. Am. 101, 2857-2865.

[5]Meddis, R., and Hewitt, M. (1992). "modelling the identification of concurrent vowels with different fundamental frequencies," J. Acoust. Soc. Am. 91, 233-245.

[6]Scheffers, M.T.M. (1983). "Sifting Vowels: Auditory Pitch Analysis and Sound Segregation," Ph.D. thesis, Rijksuniversiteit te Groningen, The Netherlands.

Eduardo Sá Marta, Luis V. Sá - Auditory Cells with Frequency Resolution Sharper than Critical Bands Play a Role in Stop Consonant Perception: Evidence from Cross-Language Recognition Experiments

Please see page 173 of the ASI Proceedings.

Fernando S. Perdigao and Luis V. Sá - Auditory Models as Front-Ends for Speech Recognition

Please see page 179of the ASI Proceedings.

C.J. Sumner and D.F.Gillies Shunting Lateral Inhibitory Networks for Processing Auditory Nerve Signals"

We describe the use of Shunting (Multiplicative) Lateral Inhibitory Networks (LINs) of the type described by Grossberg (1982) for processing Auditory Nerve Signals as produced by a typical model of the ear. Lateral Inhibitory Networks have been proposed for improving the representation of auditory spectra principally sharpening. Our approach contrasts with previous approaches (Shamma 1989, 1992) which have utilised additive networks. Compared with an additive LIN model, a shunting network offers improved normalization and representation of the same stimulus presented at different gain levels.

Jilei Tian, Kari Laurila, Ramalingam Hariharan, Imre Kiss - Front-End Design by Using Auditory Modeling in Speech Recognition

As we have known, the human auditory system is the most excellent speech recognizer. If computer-based speech recognition system could be designed that sufficiently reflects the process of auditory system, the resulting representations should be superior to representations based on non-biological criteria commonly used in computer speech recognition algorithms. The potential advantages by using auditory modeling for speech recognition task depend on how accurate the models are in mimicking human auditory system. Building such accurate models rely on the amount of knowledge we have about the auditory system. This knowledge is acquired by combining data that has been collected in psychophysical, physiological and auditory phenomena.

The purpose of the front end speech processing is to transform the original speech signal into a more suitable representation for recognition purposes. In this paper, auditory modeling was introduced to design the front end of speech recognition system. The basic idea is (1) original speech was preprocessed by pre-emphasis filter and then segmented into frames; (2) spectrum was obtained by using windowed FFT for each frame; (3) intensity was converted into loudness, directly linked to the speech perception of ear; (4) since the nonequal sensitivity of human hearing at different frequencies, the equal loudness is applied; (5) resulting from the cochlear property, the mel-scale is used to divide the subband; (6) short term adaptation was considered and the firing rate was calculated from each subband; Finally DCT was taken to decorrelate the redundancy of the information.

Preliminary results of the test on TIMIT and Names97 (Finnish name database collected by Nokia research center) were given. It showed that the auditory modeling based front end with 13 elements feature vector has slight worse recognition rate than standard front-end (MFCC) in clean condition, but better than MFCC with 13 elements feature vector in noisy condition, such as 5, 0, -5, -10 dB of SNR. Based on the preliminary results, auditory modeling front-end is more noise robust.

### Audio Coding

T. Engin Tuncer - Audio Coding by Using a Perceptual Model

Please see page 189of the ASI Proceedings.

Ye Wang - An Assessment System of Psychoacoustic Models

Please see page 195 of the ASI Proceedings.

### Frequency Analysis

C.R. Day, W.A. Ainsworth and G.F. Meyer - A Comparative Study of the FFT and Reassigned Fourier Transform

Please see page 199 of the ASI Proceedings.

Toshio Irino and Masashi Unoki - A Time-Varying, Analysis/Synthesis Auditory Filterbank Based on an IIR Gammachirp Filter

Originally, the gammachirp was derived as a function to satisfy minimal uncertainty in a time-scale representation. The gammachirp filter has been shown to be a good candidate of a level-dependent auditory filterbank. But conventionally it was an FIR filter which precluded time-varying filtering. I would like to present 1) an efficient filterbank by designing an IIR gammachirp filter, 2) a control mechanism including signal level estimation, and 3) interesting and advanced features of this filterbank.

### July 11 Auditory Scene Analysis and Adverse Acoustic Conditions

Uwe Baumann - A Procedure for Identification and Segregation of Multiple Auditory Objects

Many theories have been issued about the ability of the human auditory system to group different musical voice sources into the perception of separate melodical lines. Analogies have been drawn concerning the figure-background task performed by the visual system. Psychologists used Gestalt principles such as that of proximity, similarity and good continuation to investigate grouping strategies in auditory perception.

The intention of the paper is to present a procedure implemented on a computer to separate polyphonic music to the original voices. An hierarchical combination of ear related spectral analysis, psycho-acoustical weighting functions and psychological elements and findings of the Gestalt theory serves as base for this process. Several independent stages are contributing to the task of abstraction and the selection of meaningful contours of spectral components. The aim is the formation of related components to auditory objects. The ongoing sequence of auditory objects forms an auditory object pattern.

For the determination of thresholds above which listeners can hear out auditory objects separately, a new test paradigm was developed. Two tunes consisting of six harmonic complex tones were used to form a foreground melodic figure. According to the experimental paradigm, one or more harmonics in each tone were altered in a way that the modified components, depending on the amount of the adjustment, could elicit the perception of a second (background) melody. The task of the subjects in the listening experiments was to indicate the contour of the background melody in comparison with a preceding, unmodified melody. As a result of the experiments, very distinct adjustments are necessary to enhance the perception of the background melody. If the onset delay of th e modified spectral components exceeds 3,8~ms, the recognition of the background melody was possible. Detuning by more than 0,9\% or an increas e of the level of the component of about 2,4~dB enabled also the auditory separation of the second melody. A decrease of level leads to a change of sound quality, but the observers were not able to indicate accurately the contour of the background melody. Some experiments were addressed to the question of how frequency-modulation (FM) acts upon the recognitio n of the background melody. The results showed a threshold of discriminatio n at 1\% depth of FM. This result was very similar to the data obtained from the static mistuning experiment. Hence it can be assumed that there is no dependency on the type of FM (static or periodic). There was no evidence for an enhancement of recognition due to coherent FM: the coherent condition showed no improvement of threshold as compared to the incoherent condition.

The results of the experiments were incorporated into a procedure for identification and segregation of auditory objects. The stages of the model are discussed in detail, starting with the description of a new ear related spectral transformation (EESA, Excitation equivalent spectral analysis). The following steps include processes for obtaining spectral contour, common onsets, spectral pitch tracks and virtual pitch. Coincident onsets are put together to form a pattern of auditory objects. After the detection of weaker part tones that might have been masked, the last task of the model tries to chain acoustical objects in a way that the resulting melody is similar to the perception a human listener can voluntarily attend to.

An evaluation of the procedure with several examples of polyphonic music and speech utterances was done. The quality of the segmentation depends of the complexity of the material. Simple two-voiced polyphonic music with small reverbation is segregated with fair quality.

G. Clifford Carter and G. Betancourt - Emerging Architectures for Cognitive Neuroscience (CNS) Underwater Systems

Please see page 217 of the ASI Proceedings.

Karsten H. Lehn - Modelling the Cocktail-Party Effect using Multiple Cues of Auditory Scene Analysis

Research on understanding and modelling the human auditory system helps to improve technical acoustical systems, e. g. hearings aids or speech enhancement systems. One promising psychoacoustical modelling approach, introduced by Bregman (1990), is the field of auditory scene analysis. The starting point of the primitive auditory scene analysis is an internal time-frequency representation of the incoming signal mixture. From this representation different features, e.g. amplitude modulation, frequency modulation, common signal onset and offset, periodicity and spatial position features, can be derived and used to perform an auditory grouping of the time-frequency elements. The resulting groups are likely to correspond to the acoustical events of the sound emitting sources.

In previous works (Lehn 1997a, 1997b) is has been shown that this mechanism can be modelled by a grouping approach called "temporal fuzzy cluster analysis".

Fundamental psychoacoustical streaming experiments, originally performed by Bregman, are reproduced using this model. In addition, the most important features for concurrent speech segregation, i.e. spatial position and fundamental frequency features, are used by the model in a combined manner for performing a robust segmentation of auditory scenes.

In this paper the structure of the computational binaural auditory scene analysi s model is reviewed briefly. After this the relevance for modelling the Cocktail-Party effect is pointed out and it is shown how the model can be used as a speech enhancement system. A corpus of five english vowels on different fundamental frequencies and a set of head related transfer functions are used to set up different virtual acoustical scenes providing a test environment for the system. The model is tested in scenes that supply optimal information only for t he spatial position and fundamental frequency feature extractors respectively and i n scenes that contain both cues. The performance of enhancing the target speaker i s assessed for these scenarios, using only spatial position features, only fundamental frequency features and a combination of both types of features. As expected it turns out that the hybrid system, which relies on binaural and monaural features, performs well in all conditions.

S.L. McCabe and M.J.Denham - A Thalamocortical Model of Auditory Streaming

The auditory system segregates incoming acoustic signals into perceptual representations of sound sources within the environment, but how and where this is done is not yet clear. Psychophysical experiments on auditory streaming provide many clues about processing within the auditory system and offer a good basis for the development of models of auditory processing within a simplified paradigm. In this paper we propose a physiological basis for the process of primitive streaming, exploring our ideas by means of a computational model of the system. The model provides a novel explanation of the formation of auditory streams, and a basis for the integration of other grouping cues and attention.

Keith D. Martin - Toward Automatic Sound Source Recognition: Identifying Musical Instruments

Please see page 227 of the ASI Proceedings.

Georg Meyer, William Barry and Jacques Koreman - Vowel Pre-Nasalization as a Cue for Auditory Scene Analysis

Please see page 233 of the ASI Proceedings.

E. Mousset - Information-Theoretic Criteria for the Integration of Auditory and Visual Spatial Information

Please see page 239 of the ASI Proceedings.

### Auditory Scene Analysis and Adverse Acoustic Conditions

Torben Poulsen and Carsten Daugaard - Equivalent Threshold Sound Pressure Levels for Acoustic Test Signals of Short Duration

Hearing thresholds were measured for broadband clicks of 100 s duration and for brief tones (0.500, 1, 2, 4, and 8 kHz). The signal shape of the tone pulses was of type 2-3-2, ie each pulse consisted of two periods of linear rise and fall and three periods with constant amplitude in between. The measurements were performed with two types of headphones, Telephonics TDH-39 and Sennheiser HDA-200. The sound pressure levels were measured in an IEC 318 ear simulator with Type 1 adapter (a flat plate) and a conical ring. The audiometric methods used in the experiments were the ascending method (ISO 8253-1) and a transformed up/down procedure. Twenty-eight normal hearing test subjects (13 females and 15 males) in the age range from 18 to 26 years participated in the experiments. The results of this investigation will be used in standardisation (ISO/TC43/WG1).

Davide Rocchesso, Francesco Scalcon and Gianpaolo Borin - Subjective Evaluation of the Inharmonicity of Synthetic Piano Tones

Please see page 251 of the ASI Proceedings.

Dekun Yang, Georg F. Meyer and William A. Ainsworth - Segregation and Recognition of Concurrent Vowels for Real Speech

For almost all perceptual audio coding algorithms, it is necessary to have a build-in psychoacoustic model in order to determining the maximal allowed quantisation noise. However, the psychoacoustic models suggested in the MPEG-1 and MPEG-2 standards can only perform a rough estimation of the masking threshold of human auditory system. To improve an audio codec*s performance, one of the key issues is to improve the accuracy of the applied psychoacoustic model. The proposed system discussed in this paper provides a tool for assessment of psychoacoustic models under investigation. It performs the time-to-frequency transformation of the input audio signal, whilst running the psychoacoustic model in parallel. The output of the model is a masking threshold. This is followed by a noise generator. The spectrum of the generated noise is the same as the masking threshold. A noise injection part is performed by summing the signal and the noise in frequency domain. Following this a frequency-to-time transformation is performed. The criteria for judging a model is the maximum allowable noise inject level without audible distortion.