Switchboard Transcription System
INTRODUCTION
Switchboard is a corpus of several hundred informal speech dialogs recorded
over the telephone (Godfrey et al., 1992). The corpus is extensively used for
development and testing of speech recognition algorithms, and is considered to
be fairly representative of spontaneous discourse. In contrast to carefully
enunciated, read speech (such as found in the Wall Street Journal and TIMIT
corpora), the speech contained in Switchboard tends to deviate significantly
from the "canonical" speech patterns that linguists and speech scientists have
long associated with spoken language. For this reason the corpus offers
especially interesting challenges for phonetic transcription. Because the
speech recognition community has long used a phonetic symbol set based on, or
closely related to the original transcription system developed for TIMIT, the
present project utilizes a similar system. However, neither this symbol set,
nor any other current transcription system truly characterizes the range of
acoustic variation found in spontaneous speech. It should be born in mind that
the phonetic transcription represents only a crude approximation to the actual
sounds. Diacritics are used to modify the canonical symbols in order to
provide a more accurate representation of the acoustic properties of the
speech. Even these, however, rarely capture the full complexity and nuance of
the phonetic patterns. Caveat Emptor!
This manual describes the procedures used to phonetically transcribe the
Switchboard Corpus, and is intended primarily as a reference for members of
the Switchboard Transcription Project (STP), charged with this task. However,
because many in the speech recognition community have requested documentation
on the transcription system, additional background information is provided.
BACKGROUND
During 1996 STP used a hybrid symbol set, composed of phonetic symbols derived
from the TIMIT corpus (ArpaBet), along with diacritical elements to denote
deviation from the "canonical" pattern. Each phonetic segment was labeled
using this symbol set (included in this document as Appendix A) and its
temporal onset and offset specified. The transcriptions were derived from
phone alignments provided for each file by Bill Byrne from the Center for
Language and Speech Processing of Johns Hopkins University. Transcribers were
asked to "correct" both the phone labels and phone alignments provided by JHU
This transcription procedure was found to be rather time-consuming (taking on
average nearly 400 times real time to complete). For this reason, STP is now
transcribing on the syllabic, rather than phone level. What this means is that
transcribers are no longer asked to correct alignments for each of the phone
elements in a file. Rather, the initial phone alignments are post-processed
using a program written by Dan Ellis of ICSI that suppresses the alignments of
phones interior to each syllable, and then groups the phone labels from the
original alignments into syllabic units (based on Bill Fisher's tsyl2 program,
which is, in turn, based on rules from Dan Kahn's 1976 thesis on
syllabification of spoken English. Transcribers are thus responsible only for
insuring correct alignments for syllable units and the specification of the
phonetic composition of each of these units. This modification of the
transcription procedure appears to reduce the transcription time by a
significant amount, sufficient we hope to produce a body of transcription data
sufficient in size and scope to adequately train speech recognition systems
for this summer's speech recognition workshop at Johns Hopkins. It is our hope
that alignments for the interior phones of the syllables will be automatically
derivable from knowledge of the syllabic boundaries and phonetic constituents.
This additional information, while separate from STP, may be provided by ICSI,
or some other institution involved with the JHU summer workshop.
GENERAL TRANSCRIPTION STRATEGY
Each phone conversation is divided into a series of files containing a
relatively continuous stream of speech spoken by a single individual. The file
length is determined largely by the duration of a conversational "turn."
Speakers tend to take turns holding center stage. Occasionally, the other
speaker will interrupt the foreground speaker, thereby "grabbing" his or her
turn. The length of each conversational turn lasts between 0.5 and 25 seconds.
The majority of files last about 3 to 5 seconds. In 1996 STP filtered out
files shorter than 1 second since these generally consisted of such
interjections as "Yeah," "Uh huh," which were not sufficiently rich and varied in their linguistic associations to merit the extraordinary amount of time to
transcribe. This year we are asking that a very limited subset of these very
short files be included for transcription.
Transcription is effected with the assistance of Entropic's Xwaves program,
which displays four representational displays of the speech signal -
(1) the pressure waveform
(2) a wideband, color spectrographic display
(3) the preliminary phonetic transcription, derived from the automatic
alignments
(4) the word-level transcription provided by a team of court reporters at
an earlier stage of Switchboard development
The transcribers are asked to use all of this information, in combination with
the audio display of the speech singal to derive an accurate transcription. In
our experience no single source of information is sufficient to provide an
accurate transcription. In listening to the speech, we recommend that both the
individual segment and a broader context (at the syllabic, word and phrasal
levels) be used to derive a phonetic identity, in tandem with the waveform and
spectrographic information. They are specifically instructed not to use just
"higher-level" information in labeling the phonetic constituents.
There is a general tradeoff between "speed" and "accuracy." The transcribers
have been asked to transcribe in a manner which optimizes both speed and
accuracy, but not to be overly concerned with the fine details of the phonetic
identity of individual elements. In instances of uncertainty, elements are
designated with a question mark [?] (see below).
The previous year's transcription lumped all instances of filled pauses,
silence and non-speech sounds into a single "waste basket" category [h#]. This
classification has been modified to include a finer-grained analysis of these
extra-linguistic intervals (Section X).
Another departure from last year's transcription concerns the application of
diacritics. In the past, transcribers were encouraged to use diacritics to
modify the standard phonetic symbol for added precision. Because the use of
such diacritics requires additional time, and because many of the diacritical
symbols were not used with a high degree of consistency, their utilization has
been changed. Transcribers are now asked to mark with diacritics only when
there is a significant deviation from the normal pattern and otherwise to
refrain from using them. This policy was also motivated by the inconsistent
use of such diacritics by various recognition groups.
TRANSCRIPTION SYMBOLS
Note of the use of diacritics. Diacritics are linked to the primary segment
label with an underscore (_) symbol. They are used to denote significant
deviation from the typical pattern. For instance, certain segments tend to be
devoiced, either in whole, or more typically, during the latter portion of the
sound. A diacritic is used when the phonetic property is a significant
departure from canonical, and where it applies to at least half of the segment
duration (or in instances where less than half, the duration is appreciable,
as would be the case for a stressed or emphasized syllable.
A list of the linguistic diacritics is as follows:
_ap approximant articulation, as might occur with
certain stops
_co trace of phonetic segment in syllable coda
_epi epithentic stop (as in [w ao r m p_epi th])
_fr fricated of a usually non-fricated segment
_n nasalization of a usually non-nasalized segment
_on trace of phonetic segment in syllable onset
_vd voicing, either partial or complete, of a normally
voiceless segment
_vl devoicing, either partial or complete, of a normally
voiced segment ( e.g., 'well!' [w eh_vl l_vl])
I. Vowels (17)
iy 'beat'
ih 'bit'
ey 'bait'
eh 'bet'
ae 'bat'
aa 'bot' (as in robot)
ux high, front, tounded allophone of /uw/ as in 'suit'
ix high, central, vowel (unstresses), as in 'roses'
ax mid, central vowel (unsstressed), as in 'the'
ah mid, central vowel (stressed) as in 'butt'
uw 'boot'
uh 'book'
ao 'bought'
ay 'bite'
oy 'boy'
aw 'bough'
ow 'boat'
The temporally "reduced" vowels, [ix, ax, ux, uh] are often hard to label with
any degree of precision and confidence. The transcribers have been asked not
to spend an inordinate amount of time determining the identity of these
segments. See Section X on "filled pauses" for discussion of vocalic segments
in extra-linguistic contexts.
II. Liquids (2)
l 'led'
r 'red'
III. Glides (3)
y 'yet'
w 'wet'
hw 'what'
lg lateral glide, phonetic vowel, allophone of /l/.
IV. Syllabic resonants (6)
er 'bird'
el syllabic allophone of /l/, as in 'bottle'
em syllabic allophone of /m/, as in 'yes 'em'
('yes ma'am')
en syllabic allophone of /n/, as in 'button'
eng syllabic allophone of /ng/, as in 'Washington'
(uncommon)
V. Stops (7)
p 'pop'
b 'bob'
t 'tot'
d 'dad'
k 'kick'
g 'gag'
q glottal stop - allophone of /t/, as in 'Atlanta'
where the first /t/ can be realized as [q].
Also may occur between words in continuous speech,
especially at vowel-vowel boundaries, and at the beginning
of vowel-initial utterances.
Do not mark the stop closures explicitly, as was done in the 1996 project.
VI. Nasals (3)
m 'mom' (nasal stop)
n 'non' (nasal stop)
ng 'sing' (nasal stop - only occurs in syllable-
final position in English)
VII. Affricates (2)
ch 'church'
jh 'judge'
VIII. Fricatives (9)
f 'fief'
v 'verv'
th 'thief'
dh 'they'
s 'sis'
z 'zoo'
sh 'shoe'
zh 'measure'
hh 'hay'
IX. Flaps and trills (3)
dx alveolar flap (allophone of [d] or [t])
nx nasal flap (allophone of [n])
X. Filled Pauses (3)
A filled pause refers to a speech sound that does not denote a specific word,
but is used as a time marker, as in "uh" or "um."
pv filled pause vocalic-like segment. Typically will range
among [ax], [ah], [ix], [ih] but rather indistinct
Lexical item typically coded as "uh"
pn occasionally the filled pause is an elongated syllabic nasal
as in "hmmmmm," similar to a syllabic nasal [em]. The [m] will
be used regardless of whether the segment sounds like [m], [n]
or [ng]
pv pn filled pause vocalic segment followed by a nasal as in "um"
XI. Non-speech (2)
Non-speech refers to acoustic elements that do not possess any clear
linguistic relevance. Such instances would include a door slam, the phone
dropping on floor, clearing of the throat and laughter. These non-speech
elements should not be phonetically transcribed, but rather marked with the h#
symbol. The word-level transcription should specify the nature of this
intrusive sound with precision.
h# indicates a non-speech sound other than silence
sil silence within an utterance that does not
correspond to the closure for a stop or affricate
or to some form of non-speech.
XII. Other symbols and diacritics (5)
_cr creaky voice. Mark at segment level only if phonemically
contrastive. Otherwise indicate at comment-line level
? unknown speech sound
_? uncertain about identity of phonetic segment
_! unusual speech pattern - deviates significantly from normal
e.g., weird stress, pronunciation, etc.
_# truncated segment (as when it has been prematurely
cut by the the computer segmenter)
_______________________________
Bibliography
Godfrey, J. J., Holliman, E. C. and McDaniel, J. (1992) SWITCHBOARD: Telephone
speech corpus for research and development, IEEE ICASSP 1: 517-520.
Revision, February 19, 1997