DAn's notes on ICASSP-2000, Istanbul
This is a quick list of the papers that I saw and/or liked
at the recently completed ICASSP, along with some brief descriptions and
links to the online papers (only available from within ICSI).
Session MMSP-P1
Pedro J. Moreno and Ryan Rifkin, Compaq Research Labs (IV-2417)
Basic task is to classify audio clips gathered from the web as speech/music/other.
They used the AltaVista multimedia crawler logs to gather 13,000 web audio
clips totaling 173h, which were labeled by hand according to speech, music,
music type, background conditions etc. They point out that this is
by far the largest set of its kind, and that it represents an unbiased
snapshot of available web content. These were folded into a three-way
(speech/music/other) tag for the whole clip. Their classification
used the 'Fisher score', which maps variable-length data sets to a fixed-length
feature vector, which is the partial differential of the overall sample
likelihood with respect to each of the parameters of a generative model.
In their case, they trained 68-component GMMs for each of the three classes,
then used, for example, derivatives with respect to the pooled mixture
priors as their fixed-size Fisher space. They then used some variants
of Support Vector Machines to classify (discriminatively) based on these
scores. They get 68% classification accuracy from a likelihood test
on the base GMMs, which improves to 82% for their best SVM-based result.
D. Pye, AT&T Labs Cambridge (IV-2437)
For the problem of classifying music into genres, create signatures
for each piece by building a decision-tree Vector Quantizer, designed to
maximize the distinction of 6 genre classes, then using cosine distances
between the histograms of leaf distributions (like Foote) to classify.
Gets 90%, slightly worse than using full-up GMMs for each class, but much
quicker. Also, a simplified version of MFCCs derived directly from
the 32 subbands of MP3 (so-called MP3CEP) speeds up feature calculation
by a factor of 6 (when starting with MP3 files) with a small accuracy penalty.
H. Sundaram, S.-F. Chang, Columbia (IV-2441)
Trying to find 33 hand-located 'audio scene boundaries' in the first
hour of Blade Runner. Complicated 'listener model' is intended to
detect changes in the dominant sound sources. It fits parametric
envelopes to individual feature dimensions within an 'attention span' of
a few seconds; the modeled envelopes are then correlated against their
histories within a somewhat longer 'memory'; scene changes are marked by
a correlation that decays from its within-scene value. Detect 97% of boundaries
for 10% false alarms on this small set. Base features are cepstra,
cochlea model outputs, cepstral flux.
Session SP-P1
J. Rottland, G. Rigoll, Duisburg (III-1241)
They use a standard Quicknet-trained 1000HU MLP to get 16% WER on Wall
Street Journal. They then use the net ouptut posteriors as if they
were the likelihoods from individual mixture components of a Gaussian mixture
codebook, and use EM to estimate weights for these 'mixture components'
for each state of various HMM topologies. This allows them to go
up to 3 states per phone in the HMM, which reduces the WER to 11% using
the same neural net. Within-word triphones gets it down to a very
respectable 9.4%. Interesting comparison to the Tandem Aurora results
- maybe we're getting our gain simply from the different HMM topology,
not from the GM modeling.
I. Bazzi, J. Glass, MIT (III-1257)
Recognition is cast as a two-stage process: first, a subword-unit graph
is calculated, involving no dictionary knowledge; this is then filtered
by the dictionary and grammar constraints to get the output words.
They compare first stages generating a phone graph (61 phones) and a syllable
graph, derived from the phones with a syllable 'lexicon' of 1624 plus a
syllable trigram. Phone graphs have 50% more errors than one-stage
word recognition. Syllable graphs reduce this to 30%, and full search
on syllabic representations is only 10% worse than the standard baseline.
Motivation for two-stage recognition is to have a domain-independent stage
1 (possibly on the client side of client-server), as well as to support
OOV handling and truncated-word detection.
Session AE-P1
S. Sakaguchi, T. Arai, Y. Murahara, Sophia University (II-917)
Cute idea of flipping the polarity of whole syllables in a speech stream
as a way of transparently encoding a few bits per second of side information.
Trick is in reliably recovering syllable breaks and intended polarity.
Perceptually indetectable.
E. Di Claudio, R. Parisi, G. Orlandi, University of Rome (II-921)
Quality math to find best-fitting time differences for noisy multiple-microphone
recordings. Could be useful for analyzing the meeting recorder data.
L. Ng, G. Burnett, J. Holzrichter, T. Gable, Lawrence Livermore National
Laboratory (I-229)
Ways to use the radar sensor to help with acoustically noisy speech:
GWIN (glottal windowing) uses the accurate glottal-pulse events from the
radar detector to gate the acoustic waveform to help exclude noise; GCOR
(glottal correlation) I think superimposes several adjacent pitch cycles
of the acoustics, based on glottal closure alignment, to help reinforce
the speech and cancel the noise. I think they showed results of combining
both.
Session SP-L3
J. Bilmes, University of Washington (II-1009)
Jeff's great talk introduced me to the fact that the elements in the
inverse
of a Gaussian covariance matrix have interpretations: if the two
dimensions are conditionally independent, that value is zero. Shows
that independence measures can be used to prune the inverse matrices, and
hence speed up recognition, with little effect on recognition error.
Session SP-P4
T. Kemp, M. Schmidt, M. Westphal, A. Waibel, University of Karlsruhe, Germany
(III-1423)
Compare segmentation based on model-based (i.e. decode regions according
to pre-trained classes), metric-based (i.e. finding points of maximum difference
between adjacent windows) and energy-based (simple threshold detection
of gaps) strategies. Get best results from a combined algorithm:
chop into pieces arbitrarily, cluster them into, say, 10 clusters, train
GM models for each, then decode into those classes. Even though identity
of classes is unknown, boundaries between different segments are successfully
identified.
S. Johnson, P. Woodland, University of Cambridge, United Kingdom (III-1427)
Quick technique based on covariance matrices of 6s chunks for finding
segments in broadcast news that are exact repeats. Use this (in conjunction
with heuristics) to remove commercials, while removing very little of 'proper'
index-worthy speech. Used for TREC system.
K. Kirchhoff, University of Washington, USA, G. Fink, G. Sagerer, University
of Bielefeld, Germany (III-1435)
Katrin's articulatory features combined with standard MFCCs for VerbMobil.
Test a variety of combination rules at several levels; product rule at
state level is best, giving a 5% relative improvement over the MFCC baseline.
Session AE-L1
M. Goto, Electrotechnical Laboratory, Japan (II-757)
Another astonishing system from Goto-san, here tracking a couple of
melody lines from recorded music. Neatest part is the way he estimates
a pdf for the fundamental pitches present from the full harmonic spectra
by using EM to find a coherently explanatory set.
Session SP-P6
S.-J. Doh, R. Stern, CMU (III-1543)
MLLR requires a certain amount of training data for each class to be
adapted; when trying to estimate the transformation from very short segments
(individual utterances), it may be helpful to 'fake' certain classes based
on a previously-learned mapping to transform features from 'adjacent' classes
to act like additional data from the target classes. The idea is
that, even after transformation, the speaker-specific data will still be
distorted away from the speaker-independent models in the same way, so
it is relevant to adaptation. Using data transformed this way to
bolster the MLLR adaptation set gave significant improvements.
H. Christensen, B. Lindberg, O. Andersen, Aalborg University (III-1571)
The idea here is that, given a 4-way multiband frequency decomposition,
and given 3 kinds of feature extraction (j-rasta, plp, mfcc), what is the
best configuration of feature extraction applied to bands? For instance,
are different frequency regions better suited to different feature representations?
Session SPEC-L6 (Industrial DSP)
Les Atlas,* Mari Ostendorf,* and Gary D. Bernard**
U. Washington/Boeing
Automatic detection of machine-tool wear would be valuable.
Information comes from accelerometers mounted on the machines. But
the signals are complex and hard to classify. This paper investigates
taking ideas from speech recognition: accelerometer readings are modeled
by HMMs that show an ability to classify infrequent transient regions
according to tool age - but doesn't generalize. Other efforts to model
tool aging using a 'confidence score' (normalized cross entropy) suffer
from sparse training data.
Vahid Emamian, Mostafa Kaveh, Ahmed H. Tewfik
U. Minnesota
Looking at tell-tale acoustic transients coming from cracks e.g. of
rotor blades. Approach is to take STFT of sensor data, project down to
a low-dimension (5) via PCA, then classify the unsupervised data via
a Kohonen net (with a 5x5 output array). Get reasonable detection of
transients, but probably they were quite distinct in the PCA domain to
begin with??
Papers still to look at:
SPEECH/MUSIC DISCRIMINATION
FOR MULTIMEDIA APPLICATIONS
K. El-Maleh, M. Klein, G. Petrucci, P. Kabal, McGill (IV-2445)
MODULATION ENHANCEMENT OF SPEECH
AS A PREPROCESSING FOR REVERBERANT CHAMBERS WITH THE HEARING-IMPAIRED
A. Kusumoto, T. Arai, T. Kitamura, M. Takahashi, Y. Murahara, Sophia University
(II-853)
BINAURAL SOUND LOCALIZATION IN
AN ARTIFICIAL NEURAL NETWORK
C. Schauer, T. Zahn, P. Paschke, H.-M. Gross, Ilmenau Technical University
(II-865)
MULTIPLE FREQUENCY HARMONICS ANALYSIS
AND SYNTHESIS OF AUDIO SIGNALS
A. Jbira, A. Kondoz, University of Surrey (II-873)
A 6KBPS TO 85KBPS SCALABLE AUDIO
CODER
T. Verma, Helsinki University of Technology, Finland, T. Meng, Stanford
University (II-877)
EVALUATION OF A WARPED LINEAR
PREDICTIVE CODING SCHEME
A. Härmä, Helsinki University of Technology, Finland (II-897)
UNIFIED FRAME AND SEGMENT BASED
MODELS FOR AUTOMATIC SPEECH RECOGNITION
H.-W. Hon, K. Wang, Microsoft Corporation (II-1017)
TOWARDS LANGUAGE INDEPENDENT ACOUSTIC
MODELING
W. Byrne et al. (II-1029)
SPEECH RECONSTRUCTION FROM MEL FREQUENCY
CEPSTRAL COEFFICIENTS AND PITCH FREQUENCY
D. Chazan, R. Hoory, G. Cohen, M. Zibulski, IBM Research, Israel (III-1299)
ON THE MUTUAL INFORMATION BETWEEN
FREQUENCY BANDS IN SPEECH
M. Nilsson, S. Vang Andersen, W. Kleijn, Royal Institute of Technology,
Sweden (III-1327)
They are interested in redundancy in the speech spectrum, for instance
for resynthesizing the slope and gain of the 4-8kHz spectrum knowing only
the 0-4kHz spectrum transmitted over a telephone link. Their approach
is to vector-quantize the 0-4 kHz signal (based on an MFCC vector), then
have a MMSE estimate of 4-8 kHz slope and gain for each codeword, then
to measure the entropy of the error between the estimate and the true values.
The mutual information between low band and high band slope and gain is
lower-bounded by the reduction in entropy over the un-estimated slope,
which gives at least 0.1 bit for slope and at least 0.45 bit for gain -
which seem rather small. References Jeff's ICASSP-98 paper.
SPEECH/NON-SPEECH CLASSIFICATION
USING MULTIPLE FEATURES FOR ROBUST ENDPOINT DETECTION
W.-H. Shin, B.-S. Lee, Y.-K. Lee, J.-S. Lee, LG Corporate Institute of
Technology (III-1399)
SPEECH RECOGNITION FOR A DISTANT
MOVING SPEAKER BASED ON HMM COMPOSITION AND SEPARATION
T. Takiguchi, S. Nakamura, K. Shikano, IBM Tokyo.Nara Institute of Science
and Technology (III-1403)
LOCALIZATION OF MULTIPLE SOUND
SOURCES BASED ON A CSP ANALYSIS WITH A MICROPHONE ARRAY
T. Nishiura, T. Yamada, S. Nakamura, K. Shikano, Nara Institute of Science
and Technology (II-1053)
MULTIVARIATE-STATE HIDDEN MARKOV
MODELS FOR SIMULTANEOUS TRANSCRIPTION OF PHONES AND FORMANTS
M. Hasegawa-Johnson, University of Illinois at Urbana-Champaign (III-1323)
MUSIC SUMMARIZATION USING KEY PHRASES
B. Logan, Compaq Computer Corporation, USA, S. Chu, University of Illinois
at Urbana-Champaign (II-749)
MUSICAL INSTRUMENT RECOGNITION
USING CEPSTRAL COEFFICIENTS AND TEMPORAL FEATURES
A. Eronen, A. Klapuri, Tampere University of Technology, Finland (II-753)
SOUND ANALYSIS USING MPEG COMPRESSED
AUDIO
G. Tzanetakis, P. Cook, Princeton University, USA (II-761)
SEPARATION OF HARMONIC SOUND SOURCES
USING SINUSOIDAL MODELING
T. Virtanen, A. Klapuri, Tampere University of Technology, Finland (II-765)
MODEL-BASED SOUND SYNTHESIS OF
TANBUR, A TURKISH LONG-NECKED LUTE
C. Erkut, V. Valimaki, Helsinki University of Technology, Finland (II-769)
ACOUSTIC SOUND FROM THE ELECTRIC
GUITAR USING DSP TECHNIQUES
M. Karjalainen, H. Penttinen, V. Valimaki, Helsinki University of Technology,
Finland (II-773)
A NEW DISTANCE MEASURE FOR PROBABILITY
DISTRIBUTION FUNCTION OF MIXTURE TYPE
Z. Liu, Q. Huang, AT&T Labs (I-616)
- QBC audio similarity is used as example
INTEGRATION OF SPEECH AND VISION
USING MUTUAL INFORMATION
D. Roy, MIT (IV-2369)
COMPARATIVE ANALYSIS OF HIDDEN
MARKOV MODELS FOR MULTI-MODAL DIALOGUE SCENE INDEXING
A. Alatan, A. Akansu, W. Wolf, NJIT/Princeton (IV-2401)
dpwe@icsi.berkeley.edu 2000-06-25