Date:    Fri, 25 Oct 1996 19:04:15 PDT
X-Phys-Location: 540 Greenwich St #2, San Francisco CA 94133 USA
X-Phone: +1 (415) 956-0215
From:    "Dan Ellis" <>
Subject: STP bestiary

Dear STPers - [apologies for the overap with Steve's note - comms lag]

As promised, here are my notes on our discussion yesterday evening of the 
strange phenomena observed in transcribing the Switchboard speech.  I'm not 
exactly sure what categories of anomaly we should be paying most attention 
to, but it seems valuable and imperative to document our experience while 
it is still relatively fresh.

If you have any further ideas, or differences with how I have expressed 
the points below, I'd be very glad to get email suggestions which I will 
incorporate.  Please take the time to send me any thoughts or reactions.  

Some other random thoughts:
- One idea might be to try and gather a few actual examples (i.e. call 
  numbers and times) for each instance.  
- I liked the idea that came up last night of labeling by features rather 
  than by phonetic labels - or using the labels as short-hands for an 
  unrestricted space of feature combinations.  This would be a way to 
  canonicalize some of the weirder diacritic modifications.  But I guess 
  that's irrelevant to the subject of this message.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The Strange Universe of Switchboard Transcription

Dan Ellis <>, comprising the ideas of
Candace Cardinal, Rachel Coulston, Charles Gotcher, Steve Greenberg, 
Joy Hollenback, John Ohala, Colleen Ritchie, Gail Solomon

version of 1996oct25

In the course of a project to build a large database of detailed 
hand-transcription for utterances in the 'SWITCHBOARD' database of 
informal telephone speech, we found numerous phenomena that were difficult 
or impossible to fit into our conventional phonetic perspective.  This 
document attempts to describe and organize these exceptions.

The nature of the transcription task was a conventional phonetic 
labeling: the speech signal was divided into segments, and each segment 
was given a label, chosen from a custom set in-between a 'complete' 
phonetic catalog and the limited range typically used in automatic speech 
recognition.  However, the requirement that a single instant be chosen as 
the boundary between adjacent phonemes was problematic.  For instance:
  * Some stops get smoothed out - approximants - e.g. 'soft' /g/s, marked 
    principally by an amplitude dip (like the Spanish "agua"?)
  * Vowels adjacent to /r/ would take on 'r-coloration' with no clear 
    portion that could be called /r/ alone as distinct from the vowel
  * Vowels following /y/ could look, arguably, like a continuation of the 
    /y/ through there entire duration, making them hard to identify.
  * Post-vocalic laterals may be difficult to locate - a 'sound change'?
    ("Bill" becomes more like "Biw")
  * Gemination (?) between nasals and voiced stops may eliminate the 
    actual stop, although the stop-phoneme is marked by a burst 
    (continuous voicing through "and a")
  * Glides were finessed in the transcription system by providing for 
    'transition' regions as separate segments between two vowels.  
    However, this isn't a well-defined or satisfactory solution.
    (classic transition example from STP handout?)

In certain situations, it seemed as though a particular perception (such as 
the presence of particular word in a phrase) was carried not by the 
'traditional' representation (a particular acoustic pattern), but by some 
alternative realization (a short, silent gap or extension).  Often this 
suggested an illusion - the perception of the sound was puzzling in light 
of the absent acoustic cues - suggesting the agency of top-down 
expectation, which, none the less, may have been deliberately enlisted by 
the speaker.
  * An unstressed article before nouns seemed to have no evidence other 
    than the timing of the words around it - a slight elongation compared 
    to the pronunciation expected if the phrase did not contain an article.
  * Taps/flaps and other minimal syllabic separators (between two vowels?) 
    might be marked by a slight dip in amplitude, visible mainly in the 
    waveform, rather than any spectral change.
  * Affricates following nasals ("and the") were barely there:  you could 
    hear them, but it was faith alone that permitted you to see frication 
    energy in the vowel on the spectrogram.

The label set we used, even in conjunction with our diacritics, provided a 
fairly coarse coverage of speech sounds.  Many examples could not be 
neatly fit into the defined set - common cases included:
  * Weird syllabics - syllabic /z/ etc.
  * Fricated release of stops (an example for this?)
  * There seemed to be some kind of continuum between tap-like stops,
    glottal stops and creaky voice.  A phoneme that might otherwise be a 
    glottal stop was realized as a brief period of creakiness in a vowel - 
    but should this be marked as /q/ or with the _cr diacritic? (Is this 
    called preglottalization?)
  * /ng/ was rarely seen in 'canonical' form, seemed very variable in its 
  * Stops occurred in voiced and voiceless forms, despite their classic 
    description.  Is a /b_vl/ different from a /p/?
  * /t/ /y/ and /t/ /r/ end up sounding more like /ch/.  Is this a sound 
    change (meaning we should just label it /ch/) or is it something 
  * The actual spectrum (and sound) of /s/ and /sh/ varied considerably by 
    context;  is it 'right' to mark these as the same phoneme?
  * Unstressed (reduced) vowels are hard to classify.  It seems as though 
    they are drawn from a smaller set (?)
  * Reductions can occur for segments other than vowels, too (examples?)

Some of the problems encountered could have been avoided by simple 
modifications to the transcription conventions used, rather than reflecting 
a more serious conceptual mismatch with the notion of transcription.
  * Stops were supposed to be broken into closure and release segments, 
    e.g. /pcl/ /p/.  But sometimes a clear closure would have no obvious 
    mate.  We accommodated this mid-project by allowing the _cl diacritic.
  * Although p/b/t/d/k/g had stop/release markers, affricates (/ch/ etc) 
    didn't, which seemed inconsistent.
  * In general, we would have liked to have more rigorous definitions of 
    every label we used;  transcribers were left with more 'poetic license' 
    than was comfortable.  Something like the fabled OGI handbook would be 
    a useful thing to produce for future transcription projects.

 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -