1999may21 Dan Ellis firstname.lastname@example.org
Hynek Hermansky recently gave a talk at our lab in which he described work by Sangita Sharma and himself on looking across long time windows in a single frequency channel to find evidence for context-independent phones. Their first experiment was to form average trajectories for the energy contour in particular subbands over frames labelled with a particular phone class. These showed interesting difference-of-gaussian shapes, and extended over several hundred milliseconds, much further than you might expect (given that a phone lasts about 100ms or less).
I was interested in how these averages varied across different frequency bands (Hynek showed just a few slices), and also how the variance of the measurements varied with time. So I made an equivalent calculation, of means and variances of features over a large time-frequency window. I used our 70 hour training subset of the Broadcast News data, and our latest forced-alignment labels. The base features, in this case, were standard Bark-scaled spectral coefficients over a 32ms window (from feacalc -ras no -plp no -dom log), with per-utterance normalization so that each channel was zero mean, unit variance within each utterance. Hynek's calculations had averaged every frame carrying a particular label (meaning that a single segment would be averaged at several offsets) but for the images below I used each segment just once, aligned to the center of a stretch of identically-labelled frames.
I plotted the individual means and standard-deviations across all 19 frequency channels and for a window of +/- 25 16ms frames. The images below show mean (lower half) and standard deviation (upper half) using a common pseudocolor scale, from approx -0.5 to 1.6. Note that the standard deviations are always about 1.0, and hence orangey, whereas the means are often around zero, and hence dark-bluey. As the blue tone gets lighter, it indicates the means getting more negative; red and yellow indicate progressively more positive.
Looking at them this way, the images `make sense' on the whole. Look for instance at "eh". Considering the mean spectral feature level (lower half), we see two concentrations of energy around t=0.5 sec (i.e. the middle of the window) and at frequencies of about 7 and 12 Bark, as expected for the formants of a classic "eh". The width of the central bumps is about 100ms (from 0.45 to 0.55s) which is approximately the expected duration of such a phone, blurred of course by the fact that the average is calculated over every frame in the phone, not just the 'middle'. More interestingly, on either side of the central ridge we see slight depressions centered, for the 9 Bark channel, at about 0.38s and 0.67s. These appear to be saying that the phones about 100ms before and up to 200ms after an "eh" tend to be lower-than-average in energy. This really just follows from the fact that vowels have the most energy, but the phones adjacent to vowels are most likely not vowels, and thus will have less energy. But it's interesting and possibly unanticipated to see these factors of statistical co-occurrence showing up in these displays.
Looking at the top half, the standard deviations associated with each point, we see a roughly-constant SD of around 1.0 (after all, this is what the utterance normalization set it to), but much lower in the very center of the figure, where the spectrum always carried the "eh" label, low for both the maxima and the minima of the spectrum in the lower half. Away from the center, we see maxima of variaiton in the top few channels out 200ms or more on each side, perhaps reflecting the bimodal distribution of these channels which are dominated by sibilants.
Looking across other phones we see a range of behaviors. The vowel-like sounds have a fairly common structure reflecting the known format peaks. The dipthongs ("ay", "ow", "oy") and liquids ("w", "y") actually involve formant transitions but the averaged spectra don't really show this. Sibilants and nasals show energy bumps and holes much as we'd expect, as do the stops.
The behavior of the SDs across all the phones is quite interesting: Roughly speaking, the darker areas (indicating lower SD) mark the time-frequency regions with consistent behavior across all labeled frames, and we see a wide variation between stops and short vowels, with very limited central confident regions, to much more complex and wider patterns for things like "hv", "em" and "dx".
Hynek's point, when showing data like these, was that the width of the central structured region for any given phone class was wider than we might have guessed - out to +/- 200ms, even though we would expect a phone to extend +/- 50ms at most. However, considering how this analysis will reveal the skewed statistics associated with phones adjacent to a specific phone, this actually makes a lot of sense. It does mean that we should be very suspicious of what we are learning when we train context-independent phone classifiers for datasets with very limited phone-transition pairs, such as numbers.