Switchboard Transcription System

Switchboard is a corpus of several hundred informal speech dialogs recorded over the telephone (Godfrey et al., 1992). The corpus is extensively used for development and testing of speech recognition algorithms, and is considered to be fairly representative of spontaneous discourse. In contrast to carefully enunciated, read speech (such as found in the Wall Street Journal and TIMIT corpora), the speech contained in Switchboard tends to deviate significantly from the "canonical" speech patterns that linguists and speech scientists have long associated with spoken language. For this reason the corpus offers especially interesting challenges for phonetic transcription. Because the speech recognition community has long used a phonetic symbol set based on, or closely related to the original transcription system developed for TIMIT, the present project utilizes a similar system. However, neither this symbol set, nor any other current transcription system truly characterizes the range of acoustic variation found in spontaneous speech. It should be born in mind that the phonetic transcription represents only a crude approximation to the actual sounds. Diacritics are used to modify the canonical symbols in order to provide a more accurate representation of the acoustic properties of the speech. Even these, however, rarely capture the full complexity and nuance of the phonetic patterns. Caveat Emptor!

This manual describes the procedures used to phonetically transcribe the Switchboard Corpus, and is intended primarily as a reference for members of the Switchboard Transcription Project (STP), charged with this task. However, because many in the speech recognition community have requested documentation on the transcription system, additional background information is provided.

During 1996 STP used a hybrid symbol set, composed of phonetic symbols derived from the TIMIT corpus (ArpaBet), along with diacritical elements to denote deviation from the "canonical" pattern. Each phonetic segment was labeled using this symbol set (included in this document as Appendix A) and its temporal onset and offset specified. The transcriptions were derived from phone alignments provided for each file by Bill Byrne from the Center for Language and Speech Processing of Johns Hopkins University. Transcribers were asked to "correct" both the phone labels and phone alignments provided by JHU

This transcription procedure was found to be rather time-consuming (taking on average nearly 400 times real time to complete). For this reason, STP is now transcribing on the syllabic, rather than phone level. What this means is that transcribers are no longer asked to correct alignments for each of the phone elements in a file. Rather, the initial phone alignments are post-processed using a program written by Dan Ellis of ICSI that suppresses the alignments of phones interior to each syllable, and then groups the phone labels from the original alignments into syllabic units (based on Bill Fisher's tsyl2 program, which is, in turn, based on rules from Dan Kahn's 1976 thesis on syllabification of spoken English. Transcribers are thus responsible only for insuring correct alignments for syllable units and the specification of the phonetic composition of each of these units. This modification of the transcription procedure appears to reduce the transcription time by a significant amount, sufficient we hope to produce a body of transcription data sufficient in size and scope to adequately train speech recognition systems for this summer's speech recognition workshop at Johns Hopkins. It is our hope that alignments for the interior phones of the syllables will be automatically derivable from knowledge of the syllabic boundaries and phonetic constituents. This additional information, while separate from STP, may be provided by ICSI, or some other institution involved with the JHU summer workshop.

Each phone conversation is divided into a series of files containing a relatively continuous stream of speech spoken by a single individual. The file length is determined largely by the duration of a conversational "turn." Speakers tend to take turns holding center stage. Occasionally, the other speaker will interrupt the foreground speaker, thereby "grabbing" his or her turn. The length of each conversational turn lasts between 0.5 and 25 seconds. The majority of files last about 3 to 5 seconds. In 1996 STP filtered out files shorter than 1 second since these generally consisted of such interjections as "Yeah," "Uh huh," which were not sufficiently rich and varied in their linguistic associations to merit the extraordinary amount of time to transcribe. This year we are asking that a very limited subset of these very short files be included for transcription.

Transcription is effected with the assistance of Entropic's Xwaves program, which displays four representational displays of the speech signal -

(1)     the pressure waveform
(2)     a wideband, color spectrographic display 
(3)     the preliminary phonetic transcription, derived from the automatic 
(4)     the word-level transcription provided by a team of court reporters at 
        an earlier stage of Switchboard development

The transcribers are asked to use all of this information, in combination with the audio display of the speech singal to derive an accurate transcription. In our experience no single source of information is sufficient to provide an accurate transcription. In listening to the speech, we recommend that both the individual segment and a broader context (at the syllabic, word and phrasal levels) be used to derive a phonetic identity, in tandem with the waveform and spectrographic information. They are specifically instructed not to use just "higher-level" information in labeling the phonetic constituents.

There is a general tradeoff between "speed" and "accuracy." The transcribers have been asked to transcribe in a manner which optimizes both speed and accuracy, but not to be overly concerned with the fine details of the phonetic identity of individual elements. In instances of uncertainty, elements are designated with a question mark [?] (see below).

The previous year's transcription lumped all instances of filled pauses, silence and non-speech sounds into a single "waste basket" category [h#]. This classification has been modified to include a finer-grained analysis of these extra-linguistic intervals (Section X).

Another departure from last year's transcription concerns the application of diacritics. In the past, transcribers were encouraged to use diacritics to modify the standard phonetic symbol for added precision. Because the use of such diacritics requires additional time, and because many of the diacritical symbols were not used with a high degree of consistency, their utilization has been changed. Transcribers are now asked to mark with diacritics only when there is a significant deviation from the normal pattern and otherwise to refrain from using them. This policy was also motivated by the inconsistent use of such diacritics by various recognition groups.

Note of the use of diacritics. Diacritics are linked to the primary segment label with an underscore (_) symbol. They are used to denote significant deviation from the typical pattern. For instance, certain segments tend to be devoiced, either in whole, or more typically, during the latter portion of the sound. A diacritic is used when the phonetic property is a significant departure from canonical, and where it applies to at least half of the segment duration (or in instances where less than half, the duration is appreciable, as would be the case for a stressed or emphasized syllable.

A list of the linguistic diacritics is as follows:

       _ap     approximant articulation, as might occur with 
                certain stops
       _co     trace of phonetic segment in syllable coda
       _epi    epithentic stop (as in [w ao r m p_epi th])
       _fr     fricated of a usually non-fricated segment
       _n      nasalization of a usually non-nasalized segment
       _on     trace of phonetic segment in syllable onset
       _vd     voicing, either partial or complete, of a normally 
                voiceless segment
       _vl    devoicing, either partial or complete, of a normally 
               voiced segment ( e.g., 'well!' [w eh_vl l_vl])

I.  Vowels (17)

      iy      'beat'
      ih      'bit'
      ey      'bait'
      eh      'bet'
      ae      'bat'
      aa      'bot' (as in robot)
      ux      high, front, tounded allophone of /uw/ as in 'suit'
      ix      high, central, vowel (unstresses), as in 'roses'
      ax      mid, central vowel (unsstressed), as in 'the'
      ah      mid, central vowel (stressed) as in 'butt'
      uw      'boot'
      uh      'book'
      ao      'bought'
      ay      'bite'
      oy      'boy'
      aw      'bough'
      ow      'boat'

The temporally "reduced" vowels, [ix, ax, ux, uh] are often hard to label with any degree of precision and confidence. The transcribers have been asked not to spend an inordinate amount of time determining the identity of these segments. See Section X on "filled pauses" for discussion of vocalic segments in extra-linguistic contexts.

II. Liquids (2)

        l       'led'
        r       'red'

III. Glides (3)

        y       'yet'
        w       'wet'
        hw      'what' 
        lg      lateral glide, phonetic vowel, allophone of /l/.

IV. Syllabic resonants (6)

        er      'bird'
        el      syllabic allophone of /l/, as in 'bottle'
        em      syllabic allophone of /m/, as in 'yes 'em' 
                 ('yes ma'am')
        en      syllabic allophone of /n/, as in 'button'
        eng     syllabic allophone of /ng/, as in 'Washington'
V. Stops (7)

        p       'pop'
        b       'bob'
        t       'tot'
        d       'dad'
        k       'kick'
        g       'gag'
        q       glottal stop - allophone of /t/, as in 'Atlanta'
                 where the first /t/ can be realized as [q].
                Also may occur between words in continuous speech,
                especially at vowel-vowel boundaries, and at the beginning
                of vowel-initial utterances.

Do not mark the stop closures explicitly, as was done in the 1996 project.

VI.  Nasals (3)

        m       'mom'   (nasal stop)
        n       'non'   (nasal stop)
        ng      'sing'  (nasal stop - only occurs in syllable-
                         final position in English)

VII. Affricates (2)

        ch      'church'
        jh      'judge'

VIII. Fricatives (9)

        f       'fief'
        v       'verv'
        th      'thief'
        dh      'they'
        s       'sis'
        z       'zoo'
        sh      'shoe'
        zh      'measure'
        hh      'hay'

IX.  Flaps and trills (3)

         dx    alveolar flap (allophone of [d] or [t])
         nx    nasal flap (allophone of [n])

X. Filled Pauses (3)
A filled pause refers to a speech sound that does not denote a specific word, 
but is used as a time marker, as in "uh" or "um."

         pv     filled pause vocalic-like segment. Typically will range
                among [ax], [ah], [ix], [ih] but rather indistinct
                Lexical item typically coded as "uh"

         pn     occasionally the filled pause is an elongated syllabic nasal
                as in "hmmmmm," similar to a syllabic nasal [em]. The [m] will 
                be used regardless of whether the segment sounds like [m], [n] 
                or [ng]

         pv pn  filled pause vocalic segment followed by a nasal as in "um"

XI. Non-speech (2)
Non-speech refers to acoustic elements that do not possess any clear linguistic relevance. Such instances would include a door slam, the phone dropping on floor, clearing of the throat and laughter. These non-speech elements should not be phonetically transcribed, but rather marked with the h# symbol. The word-level transcription should specify the nature of this intrusive sound with precision.

          h#     indicates a non-speech sound other than silence 
         sil     silence within an utterance that does not 
                 correspond to the closure for a stop or affricate
                 or to some form of non-speech.
XII. Other symbols and diacritics (5)
        _cr     creaky voice. Mark at segment level only if phonemically
                contrastive. Otherwise indicate at comment-line level
         ?      unknown speech sound
        _?      uncertain about identity of phonetic segment
        _!      unusual speech pattern - deviates significantly from normal
                  e.g., weird stress, pronunciation, etc.
        _#      truncated segment (as when it has been prematurely 
                 cut by the the computer segmenter)

Godfrey, J. J., Holliman, E. C. and McDaniel, J. (1992) SWITCHBOARD: Telephone 
speech corpus for research and development, IEEE ICASSP 1: 517-520.
Revision, February 19, 1997

Back to the ICSI homepage