The speech signal is not uniform across time, but may be characterized as consisting of alternations between steady-state and transition regions - very roughly, vowels and consonants. Earlier work in our group (the SPAM, or stochastic perceptual auditory event model [spam94, spam95, spam96]) has focussed exclusively on the transition regions; work by FPMs has tried to model only the steady-state portions, which we may call `antispam' [antispam]. We have recently submitted a proposal to combine these approaches which we described as "Spam & Antispam" [NSAprop98].
Over the past two weeks I have invested a burst of energy in investigating an approach of this kind. The work performed so far uses a model of acoustics simpler than the segmental model of Spam, but otherwise it has a similar basic idea of having separate, independent classifiers for steady-state and transition regions. In particular, I used our conventional phoneme-based processing as my expert on steady-state regions, then trained a second classifier to recognize the transitions between the phoneme states, a unit that has been called a `transeme'[ref].
Although I have not yet been able to show any recognition improvement resulting from this approach, it seems very likely that the thread is worth pursuing. This report describes the motivation and the work performed so far to act as a foundation for subsequent work on this project.
As observed above, the speech spectrogram can be divided into alternating periods of steady spectra (e.g. vowel nucleii and fricatives) and regions of abrupt spectral transition (associated with rapid articulator changes such as stop releases and nasal closures). In between these extremes is a third class of more gradually varying spectra typical of dipthongs and semivowels. The typical spectrogram below has been annotated to illustrate these regions:
The basic formulation of hidden Markov model speech recognition involves a frame-by-frame probabilistic association between acoustic features and phonetic labels. Although the time-support of specific features may vary, the job of the classifier (be it a neural network or a Gaussian model) is to estimate the likelihood that a particular narrow segment of the signal is an instance of each of the previously-specified phone models. This will be most straightforward if a particular phone has a regular, constant spectral shape - a crude but justifiable description of vowels. On the other hand, if a phone is intrinsically defined by the spectral dynamics that accompany it, steady-state spectral templates are an inefficient and cumbersome modeling domain.
Current systems almost universally mitigate this problem by using delta (difference) and double-delta features in addition to the undifferenced values, thereby converting a dynamic trend over a broader temporal context into a constant value that may be matched by a single-frame's phone model. But it is natural to ask whether it would be possible to design a different type of model for these transition regions, one that does not expect nor rely upon the assumption of features that are static at every frame, but that directly models the variation over a short timescale that characterizes certain speech events? There are some properties of neural net classifiers that make them peculiarly appropriate for this task: our network classifiers use a temporal context window of nine adjacent frames rather than looking at just one; Gaussian models are expensive to use with such highly-correlated feature dimensions, but the net accommodates this covariance without complaint. If the characteristic dynamics of a speech signal event can be contained within this temporal window, we might hope that the net will be able to learn to recognize them directly and effectively.
The vision then is to have two complementary acoustic models, one specializing in the classification of steady-state spectra, the other oriented specifically toward episodes of rapid spectral change (our previous categorization nominated a third class of more continuous spectral change, and one could perhaps argue that a third channel might also be beneficial, but for now we will assume this case can be covered by one or both of the other classifiers). Once we have separate classification stages for these different signal conditions, we have the opportunity to use different signal processing and features best adapted to each case, rather than the single, compromise feature set generally employed. So for identifying transition regions it may be beneficial to exaggerate the RASTA-style enhancement of spectral changes used in our current feature processing; by contrast, classification of steady-state spectra may be best served by processing that minimizes the influence of a transition on the following vowel, and may even work better with a context window narrower than our normal 9 time frames.
Apart from the opportunity for different feature processing, there is a more theoretical motivation to have these two classifiers as separate, which is that the two kinds of classes they are handling are not entirely mutually exclusive. When making a judgment for a particular frame, it may be desirable to classify it both as part of a transition region and as belonging to a particular steady-state vowel class, rather than trying to decide these two choices. If a single classifier were used for both classes, these labels would be set in opposition (particularly for the discriminatively-trained neural networks, and particularly for the softmax output units we use); with two independent classifiers, such dual-class labelling is the rule. (Garbage classes in each path permit a frame to be marked as `not from this set' if a single label is in fact the most appropriate).
While having several classifier paths has clear attractions, it also raises the issue of combining results into a single final answer. This is something in which we have lately become increasingly interested at ICSI [wu ref], so a number of options is available. In the experiments described below, two approaches were tried - firslty, using both sets of posterior probability estimates as inputs to a single hidden Markov decoder, where the word models alternated between steady and transition classes, and secondly an HMM combination approach that used a single, compex model where each state corresponded to a pair of classifications, one for each stream. Other approaches include more complex variants and combinations of these structures, as well as more independent stream decoding via the `two level decoding' we are currently pursuing [fosler ref].
One final point to mention here is the motivation to improve acoustic confidence measures. Recent work [gethin] has proposed a number of confidence measures at the scale of words or finer, based on the actual acoustic scores corresponding to best-path labellings. In a single-stream, mutually-exclusive classification there will always be transition regions where no single class is dominant, and hence even the best acoustic score may not be very high. This limits the best score available from acoustic confidence measures, and impacts their ability to distinguish between low acoustic scores at segment boundaries (which is expected) and elsewhere in the segment (which indicates genuine cause for reduced confidence). By having two classifiers whose transitions are staggered such than one stream is in `segment center' when the other is in transition, one can imagine a best-path acoustic confidence score that avoids the lower values at segment boundaries altogether, giving a more sensitive indication of the net acoustic-model-match confidence.
With the goal of training a classifier for transition regions, the first task was to produce labelled training data for this class. The problem is that defining the transition between each pair of phones as a separate class squares the number of classes from 56 to several thousand, which would greatly increase the number of classifier parameters which might tax the neural network implementation and would in any case require much larger amounts of training data. Rather, phones on each side of the transition are clustered into a smaller number of broad classes, and transition units are defined in terms of these less numerous types. Work at FPMs on using transition regions in addition to normal phone units found that transition classes based on the place of articulation (labial, dental etc.) gave better results than either `manner' classes (stop, fricative etc.) or even a data-driven clustering they tried. I therefore adopted their nine classes, giving a new representation with 82 classes, namely the 81 transitions between each of the nine place-of-articulation classes, plus one more `garbage' class to label the speech inbetween transition regions (regions of non-transition silence were labelled with the silence-silence `transition', which has no other application). The mapping between the ICSI56 phoneset and the nine classes is shown below. Each place-of-articulation class is represented by a one- or two-letter, lower-case code, but later I defined another set of one-letter mixed-case codes so that I could represent the transition between two classes with just two letters. The lower-case-only codes came in useful, later, though, when I wanted to use case to distinguish the two halves of a transeme, as will be explained.
Having defined the 82 transition classes, the next stage was to generate some target labels in these terms to allow a classifier to be trainined. Starting from the hand labels for the NUMBERS data, each boundary between two phones became the center of a transition unit whose identity was determined by the place-of-articulation classes of the earlier and later phones. In order to focus on the transitions, the width of the transition unit was limited; first I tried 50ms (for transition regions up to 100ms wide), but later I reduced it to 20ms (for 40ms transition regions) so as to avoid training labels well into the steady-state region of certain phones as still `transition'. The tcl routines in ~dpwe/projects/spamnotspam/src/phn2antiphn.tcl perform this operation directly on TIMIT label files, writing new TIMIT-style label files, expressed in terms of the new classes, as output. All the `knowledge' about the transition classes and the mapping of ICSI56 is hard-coded into that script file. The figure below shows a comparison between the original phone-centered labellings and the broad-class transition labels. Each transition segment has a two-character label composed of the one-character codes for the left and right context classes, according to the table above.
Given the labels in terms of the new phoneset, the transeme net can be trained. The transition-labelling files were converted to label pfiles in terms of the new phoneset indices with xlabel2pfile; since the number of frames in each utterance hasn't changed, the standard features pfile can be reused. I trained up a 162:400:82 unit net based on 9 frames of rasta-plp-8 plus deltas. The weights for this transition net are in ???.wts. The figure below shows typical probability outputs for the transeme classifier, alongside outputs for the phoneme classifier fed with the same features. The skew between the transitions in each probability stream is evident.
Having trained up a classifier net, the next thing to try was word recognition based on these features alone. In order to test this, I needed a set of pronunciations for the entire numbers lexicon in terms of transition classes. The most flexible format for modifying pronunciations is the Noway `dictionary' file (even though I was using the Y0 decoder through the dr_embed scripts), so I converted the original Y0 model file I had been using into Noway format with y0lex2nowayphones (which generates both the Noway dictionary file and the corresponding file of per-phone state models).
Thr original Y0 model file I had been using translated to 174 distinct pronunciations in the Noway dictionary, which I understand is the full set originally derived from the OGI hand-transcriptions by Dan Gildea. Some of these pronunciations were of low probability and doubtful form, such as variants of words ending in /n/ that had a subsequent reduced vowel appended (e.g. /f ih f t iy n ax/). I went through all the original pronunciations by hand, stripping out the ones that seemed unreasonable or had very low estimated probabilities, and rounding out a few sets like the possible combinations of first and second vowels in "seven". When I re-ran my standard phone-based baseline system with this new, 70-entry lexicon, the error rate went down a little from XX% to XX%. This new lexicon is xxxxxx.dct in Noway format, and xxxxx.y0lex after recombination with the phone models to build a Y0 lexicon.
To derive a new set of transition-based pronunciations from this lexicon, I added extra routines to the Tcl source in phn2antiphn.tcl to read and write dictionary files, and to convert pronunciations into the implied sequence of broad-class transitions. Problems arise at the starts and ends of words when defining pronunciations in terms of transitions; the correct transition class depends of the abutting phoneme in the adjacent word, and that transition is, notionally, shared between the two words. Our word-oriented decoding cannot at present accommodate such cross-word dependency in the pronunciations, so some compromise was required. I chose to attach the inter-word transitions to the beginnings of each word pronunciation (which form the transition from the end of the preceeding word), and, since I couldn't enforce the single transition appropriate to the phoneme at the end of that unknown word, I made it a set of 9 alternative transitions, into the known initial phoneme class of the current pronunciation from each of the 9 possible classes for the preceding sound. I actually did this by employing a set of new symbols in the rewritten pronunciations such as xF, meaning a transition from an unknown phoneme (indicated by `x') at the end of the preceeding word to a front consonant (F), or whichever class starts the pronunciation; the 9-way parallel choices were implemented by special hand-crafted state models defined for these symbols in the Noway model file.
The principle of rewriting these pronunciations is illustrated in the figure below. Each original phoneme is mapped into its broad class, and the boundaries between each pair of class labels result in a single transition label between those two classes. Since successive phonemes can actually be from the same class (i.e. the /ao/ and /r/ in the example, which are both classed as middle vowels), it's possible to have transitions from a class to itself which none the less corresponds to a distinct spectral transition (and will have been labelled as such in the training set). Between each transeme label there may be several frames of steady-state vowel, accommodated by inserting the "o" state inbetween each transition state. As a further complication, the originral pronunciations included minimum-durations for each label (implemented as a obligatory prefix chain of states); although the transition models are intrinsically brief and hence not appropriate for extra duration enforcement, the minimum durations of the phone states are preserved in the garbage models which are centered on them, with the exception that the duration is reduced by two steps, to allow for one frame of transition at each end in the shortest realization. When the minimum duration was already 2 or less, this results in an "o0" vowel model, which is one that can be skipped altogether. However, it is probably important to retain the potential for some vowel-model states to get inserted, to allow the model to cover much slower instances of the word.
With pronunciations defined for the transeme-based labellings, it became possible to attempt a full word recognition based on these units. This was essentially a conventional recognition, with the transeme classification net as the probability estimater, feeding a decoder searching over the word models just defined. This system was able to perform recognition over the NUMBERS test set; however, its performance was considerably worse than the original phoneme based model, with approximately twice the errors. These results are summarized in the table below.