DAn's notes on ICASSP-2000, Istanbul

This is a quick list of the papers that I saw and/or liked at the recently completed ICASSP, along with some brief descriptions and links to the online papers (only available from within ICSI).

Session MMSP-P1

USING THE FISHER KERNEL METHOD FOR WEB AUDIO CLASSIFICATION

Pedro J. Moreno and Ryan Rifkin, Compaq Research Labs (IV-2417)

Basic task is to classify audio clips gathered from the web as speech/music/other. They used the AltaVista multimedia crawler logs to gather 13,000 web audio clips totaling 173h, which were labeled by hand according to speech, music, music type, background conditions etc. They point out that this is by far the largest set of its kind, and that it represents an unbiased snapshot of available web content. These were folded into a three-way (speech/music/other) tag for the whole clip. Their classification used the 'Fisher score', which maps variable-length data sets to a fixed-length feature vector, which is the partial differential of the overall sample likelihood with respect to each of the parameters of a generative model. In their case, they trained 68-component GMMs for each of the three classes, then used, for example, derivatives with respect to the pooled mixture priors as their fixed-size Fisher space. They then used some variants of Support Vector Machines to classify (discriminatively) based on these scores. They get 68% classification accuracy from a likelihood test on the base GMMs, which improves to 82% for their best SVM-based result.

CONTENT-BASED METHODS FOR THE MANAGEMENT OF DIGITAL MUSIC

D. Pye, AT&T Labs Cambridge (IV-2437)

For the problem of classifying music into genres, create signatures for each piece by building a decision-tree Vector Quantizer, designed to maximize the distinction of 6 genre classes, then using cosine distances between the histograms of leaf distributions (like Foote) to classify. Gets 90%, slightly worse than using full-up GMMs for each class, but much quicker. Also, a simplified version of MFCCs derived directly from the 32 subbands of MP3 (so-called MP3CEP) speeds up feature calculation by a factor of 6 (when starting with MP3 files) with a small accuracy penalty.

AUDIO SCENE SEGMENTATION USING MULTIPLE FEATURES, MODELS AND TIME SCALES

H. Sundaram, S.-F. Chang, Columbia (IV-2441)

Trying to find 33 hand-located 'audio scene boundaries' in the first hour of Blade Runner. Complicated 'listener model' is intended to detect changes in the dominant sound sources. It fits parametric envelopes to individual feature dimensions within an 'attention span' of a few seconds; the modeled envelopes are then correlated against their histories within a somewhat longer 'memory'; scene changes are marked by a correlation that decays from its within-scene value. Detect 97% of boundaries for 10% false alarms on this small set. Base features are cepstra, cochlea model outputs, cepstral flux.

Session SP-P1

TIED POSTERIORS: AN APPROACH FOR EFFECTIVE INTRODUCTION OF CONTEXT DEPENDENCY IN HYBRID NN/HMM LVCSR

J. Rottland, G. Rigoll, Duisburg (III-1241)

They use a standard Quicknet-trained 1000HU MLP to get 16% WER on Wall Street Journal. They then use the net ouptut posteriors as if they were the likelihoods from individual mixture components of a Gaussian mixture codebook, and use EM to estimate weights for these 'mixture components' for each state of various HMM topologies. This allows them to go up to 3 states per phone in the HMM, which reduces the WER to 11% using the same neural net. Within-word triphones gets it down to a very respectable 9.4%. Interesting comparison to the Tandem Aurora results - maybe we're getting our gain simply from the different HMM topology, not from the GM modeling.

HETEROGENEOUS LEXICAL UNITS FOR AUTOMATIC SPEECH RECOGNITION: PRELIMINARY INVESTIGATIONS

I. Bazzi, J. Glass, MIT (III-1257)

Recognition is cast as a two-stage process: first, a subword-unit graph is calculated, involving no dictionary knowledge; this is then filtered by the dictionary and grammar constraints to get the output words. They compare first stages generating a phone graph (61 phones) and a syllable graph, derived from the phones with a syllable 'lexicon' of 1624 plus a syllable trigram. Phone graphs have 50% more errors than one-stage word recognition. Syllable graphs reduce this to 30%, and full search on syllabic representations is only 10% worse than the standard baseline. Motivation for two-stage recognition is to have a domain-independent stage 1 (possibly on the client side of client-server), as well as to support OOV handling and truncated-word detection.

Session AE-P1

THE EFFECT OF POLARITY INVERSION OF SPEECH ON HUMAN PERCEPTION AND DATA HIDING AS AN APPLICATION

S. Sakaguchi, T. Arai, Y. Murahara, Sophia University (II-917)

Cute idea of flipping the polarity of whole syllables in a speech stream as a way of transparently encoding a few bits per second of side information. Trick is in reliably recovering syllable breaks and intended polarity. Perceptually indetectable.

MULTI-SOURCE LOCALIZATION IN REVERBERANT ENVIRONMENTS BY ROOT-MUSIC AND CLUSTERING

E. Di Claudio, R. Parisi, G. Orlandi, University of Rome (II-921)

Quality math to find best-fitting time differences for noisy multiple-microphone recordings. Could be useful for analyzing the meeting recorder data.

DENOISING OF HUMAN SPEECH USING COMBINED ACOUSTIC AND EM SENSOR SIGNAL PROCESSING

L. Ng, G. Burnett, J. Holzrichter, T. Gable, Lawrence Livermore National Laboratory (I-229)

Ways to use the radar sensor to help with acoustically noisy speech: GWIN (glottal windowing) uses the accurate glottal-pulse events from the radar detector to gate the acoustic waveform to help exclude noise; GCOR (glottal correlation) I think superimposes several adjacent pitch cycles of the acoustics, based on glottal closure alignment, to help reinforce the speech and cancel the noise. I think they showed results of combining both.

Session SP-L3

FACTORED SPARSE INVERSE COVARIANCE MATRICES

J. Bilmes, University of Washington (II-1009)

Jeff's great talk introduced me to the fact that the elements in the inverse of a Gaussian covariance matrix have interpretations: if the two dimensions are conditionally independent, that value is zero. Shows that independence measures can be used to prune the inverse matrices, and hence speed up recognition, with little effect on recognition error.

Session SP-P4

STRATEGIES FOR AUTOMATIC SEGMENTATION OF AUDIO DATA

T. Kemp, M. Schmidt, M. Westphal, A. Waibel, University of Karlsruhe, Germany (III-1423)

Compare segmentation based on model-based (i.e. decode regions according to pre-trained classes), metric-based (i.e. finding points of maximum difference between adjacent windows) and energy-based (simple threshold detection of gaps) strategies. Get best results from a combined algorithm: chop into pieces arbitrarily, cluster them into, say, 10 clusters, train GM models for each, then decode into those classes. Even though identity of classes is unknown, boundaries between different segments are successfully identified.

A METHOD FOR DIRECT AUDIO SEARCH WITH APPLICATIONS TO INDEXING AND RETRIEVAL

S. Johnson, P. Woodland, University of Cambridge, United Kingdom (III-1427)

Quick technique based on covariance matrices of 6s chunks for finding segments in broadcast news that are exact repeats. Use this (in conjunction with heuristics) to remove commercials, while removing very little of 'proper' index-worthy speech. Used for TREC system.

CONVERSATIONAL SPEECH RECOGNITION USING ACOUSTIC AND ARTICULATORY INPUT

K. Kirchhoff, University of Washington, USA, G. Fink, G. Sagerer, University of Bielefeld, Germany (III-1435)

Katrin's articulatory features combined with standard MFCCs for VerbMobil. Test a variety of combination rules at several levels; product rule at state level is best, giving a 5% relative improvement over the MFCC baseline.

Session AE-L1

A ROBUST PREDOMINANT-F0 ESTIMATION METHOD FOR REAL-TIME DETECTION OF MELODY AND BASS LINES IN CD RECORDINGS

M. Goto, Electrotechnical Laboratory, Japan (II-757)

Another astonishing system from Goto-san, here tracking a couple of melody lines from recorded music. Neatest part is the way he estimates a pdf for the fundamental pitches present from the full harmonic spectra by using EM to find a coherently explanatory set.

Session SP-P6

INTER-CLASS MLLR FOR SPEAKER ADAPTATION

S.-J. Doh, R. Stern, CMU (III-1543)

MLLR requires a certain amount of training data for each class to be adapted; when trying to estimate the transformation from very short segments (individual utterances), it may be helpful to 'fake' certain classes based on a previously-learned mapping to transform features from 'adjacent' classes to act like additional data from the target classes. The idea is that, even after transformation, the speaker-specific data will still be distorted away from the speaker-independent models in the same way, so it is relevant to adaptation. Using data transformed this way to bolster the MLLR adaptation set gave significant improvements.

EMPLOYING HETEROGENEOUS INFORMATION IN A MULTI-STREAM FRAMEWORK

H. Christensen, B. Lindberg, O. Andersen, Aalborg University (III-1571)

The idea here is that, given a 4-way multiband frequency decomposition, and given 3 kinds of feature extraction (j-rasta, plp, mfcc), what is the best configuration of feature extraction applied to bands? For instance, are different frequency regions better suited to different feature representations?

Session SPEC-L6 (Industrial DSP)

HIDDEN MARKOV MODELS FOR MONITORING MACHINING TOOL-WEAR

Les Atlas,* Mari Ostendorf,* and Gary D. Bernard**
U. Washington/Boeing

Automatic detection of machine-tool wear would be valuable. Information comes from accelerometers mounted on the machines. But the signals are complex and hard to classify. This paper investigates taking ideas from speech recognition: accelerometer readings are modeled by HMMs that show an ability to classify infrequent transient regions according to tool age - but doesn't generalize. Other efforts to model tool aging using a 'confidence score' (normalized cross entropy) suffer from sparse training data.

ROBUST CLUSTERING OF ACOUSTIC EMISSION SIGNALS USING THE KOHONEN NETWORK

Vahid Emamian, Mostafa Kaveh, Ahmed H. Tewfik
U. Minnesota

Looking at tell-tale acoustic transients coming from cracks e.g. of rotor blades. Approach is to take STFT of sensor data, project down to a low-dimension (5) via PCA, then classify the unsupervised data via a Kohonen net (with a 5x5 output array). Get reasonable detection of transients, but probably they were quite distinct in the PCA domain to begin with??

Papers still to look at:

SPEECH/MUSIC DISCRIMINATION FOR MULTIMEDIA APPLICATIONS

K. El-Maleh, M. Klein, G. Petrucci, P. Kabal, McGill (IV-2445)

MODULATION ENHANCEMENT OF SPEECH AS A PREPROCESSING FOR REVERBERANT CHAMBERS WITH THE HEARING-IMPAIRED

A. Kusumoto, T. Arai, T. Kitamura, M. Takahashi, Y. Murahara, Sophia University (II-853)

BINAURAL SOUND LOCALIZATION IN AN ARTIFICIAL NEURAL NETWORK

C. Schauer, T. Zahn, P. Paschke, H.-M. Gross, Ilmenau Technical University (II-865)

MULTIPLE FREQUENCY HARMONICS ANALYSIS AND SYNTHESIS OF AUDIO SIGNALS

A. Jbira, A. Kondoz, University of Surrey (II-873)

A 6KBPS TO 85KBPS SCALABLE AUDIO CODER

T. Verma, Helsinki University of Technology, Finland, T. Meng, Stanford University (II-877)

EVALUATION OF A WARPED LINEAR PREDICTIVE CODING SCHEME

A. Härmä, Helsinki University of Technology, Finland (II-897)

UNIFIED FRAME AND SEGMENT BASED MODELS FOR AUTOMATIC SPEECH RECOGNITION

H.-W. Hon, K. Wang, Microsoft Corporation (II-1017)

TOWARDS LANGUAGE INDEPENDENT ACOUSTIC MODELING

W. Byrne et al. (II-1029)

SPEECH RECONSTRUCTION FROM MEL FREQUENCY CEPSTRAL COEFFICIENTS AND PITCH FREQUENCY

D. Chazan, R. Hoory, G. Cohen, M. Zibulski, IBM Research, Israel (III-1299)

ON THE MUTUAL INFORMATION BETWEEN FREQUENCY BANDS IN SPEECH

M. Nilsson, S. Vang Andersen, W. Kleijn, Royal Institute of Technology, Sweden (III-1327)

They are interested in redundancy in the speech spectrum, for instance for resynthesizing the slope and gain of the 4-8kHz spectrum knowing only the 0-4kHz spectrum transmitted over a telephone link. Their approach is to vector-quantize the 0-4 kHz signal (based on an MFCC vector), then have a MMSE estimate of 4-8 kHz slope and gain for each codeword, then to measure the entropy of the error between the estimate and the true values. The mutual information between low band and high band slope and gain is lower-bounded by the reduction in entropy over the un-estimated slope, which gives at least 0.1 bit for slope and at least 0.45 bit for gain - which seem rather small. References Jeff's ICASSP-98 paper.

SPEECH/NON-SPEECH CLASSIFICATION USING MULTIPLE FEATURES FOR ROBUST ENDPOINT DETECTION

W.-H. Shin, B.-S. Lee, Y.-K. Lee, J.-S. Lee, LG Corporate Institute of Technology (III-1399)

SPEECH RECOGNITION FOR A DISTANT MOVING SPEAKER BASED ON HMM COMPOSITION AND SEPARATION

T. Takiguchi, S. Nakamura, K. Shikano, IBM Tokyo.Nara Institute of Science and Technology (III-1403)

LOCALIZATION OF MULTIPLE SOUND SOURCES BASED ON A CSP ANALYSIS WITH A MICROPHONE ARRAY

T. Nishiura, T. Yamada, S. Nakamura, K. Shikano, Nara Institute of Science and Technology (II-1053)

MULTIVARIATE-STATE HIDDEN MARKOV MODELS FOR SIMULTANEOUS TRANSCRIPTION OF PHONES AND FORMANTS

M. Hasegawa-Johnson, University of Illinois at Urbana-Champaign (III-1323)

MUSIC SUMMARIZATION USING KEY PHRASES

B. Logan, Compaq Computer Corporation, USA, S. Chu, University of Illinois at Urbana-Champaign (II-749)

MUSICAL INSTRUMENT RECOGNITION USING CEPSTRAL COEFFICIENTS AND TEMPORAL FEATURES

A. Eronen, A. Klapuri, Tampere University of Technology, Finland (II-753)