The Meeting Recorder Project: Transcription Methods

- Jane Edwards, ICSI

Overview

The Meeting Recorder Project transcripts are word-level transcripts, with speaker identifier, and some additional information: overlaps, interrupted words, restarts, vocalized pauses, backchannels, and contextual comments, and nonverbal events (which are further subdivided into vocal types such as cough and laugh, and nonvocal types such as door slams and clicks). Each event is tied to the time line through our modified version of the "Transcriber" interface.
This interface provides an editing space at the top of the screen (for adding utterances, etc.), and the wave form at the bottom, with mechanisms for flexibly navigating through the audio recording, and listening and re-listening to chunks of virtually any size the user wishes. Our modifications involve enabling the user to switch the playback quickly between a number of audio files and using multiple display bands, one for each speaker's channel, rather than only one display band. For detailed information and screen shots of the modified interface, see here.
In the interests of maximal speed, accuracy and consistency, the transcription conventions were chosen so as to be:
1) quick to type,
2) related to standard literary conventions where possible (e.g., - for interrupted word or thought, .. for pause, using standard orthography rather than IPA), and
3) minimalist (requiring no more decisions by transcribers than absolutely necessary).
In addition to words, speaker IDs, and minimal added detail (described in greater detail below), these transcripts are time-synchronized to the digitized audio recording. The interface enables time bins to be encoded by hand easily enough, but this would be too time consuming for so many hours of meetings.
We speeded the marking of time bins by providing the transcriber with an automatically presegmented version (described elsewhere on this site) of each meeting, containing the segmenter's best guesses of regions containing speech and nonspeech, complete with time-synchronized markings. The accuracy of the segmenter is excellent, and means that transcribers can navigate past the sometimes huge sections of a meeting in which a particular speaker was entirely silent. It also save times in that the transcribers can merely adjust existing time boundaries rather than having to enter them all by hand.
After the transcribers are finished transcribing, their work is edited for consistency and completeness by a senior researcher. Editing involved checking exhaustive listings of forms in the data, spell checking, and use of scripts to identify and automatically encode certain distinctions (e.g., the distinction between vocalized nonverbal events, such as cough, and nonvocalized nonverbal events, like door slams).

Transcription Conventions

The transcription conventions described above are being incorporated into the data in three stages. These are described in greater detail below.

Stage 1 Conventions
Designed to be robust against human error (due to simplicity and consistency), as quick for data entry as possible (due to requiring as few decisions as possible).

Word-level

Orthographic (within limits of pronunciation modelling)
Small set of "spoken forms" (e.g., cuz, gonna, mm-hmm, etc.)

Speaker ID - for example, A: - one or more capital letters followed by color. Colon is used only to mark speaker IDs, line-initially or within curly bracketted comments .
Time bins, roughly, at clean breaks between words or utterances.
Overlaps - are encoded indirectly, in a manner similar to musical score notation. Overlapping events are visible by looking vertically across the green bands in the channeltrans window, and are detectable also by means of overlapping synchronization times.
Fragments (restarts, interruptions, etc.)

Words - hyphen is attached to the end of the fragment - e.g., "th-"
Larger structures - hyphen is preceded and followed by a space - e.g., "The only - I mean, the first"
Where they co-occur (i.e., disruption of larger structure which begins with a word fragment), use the word fragment convention.
Nonverbal events and contextual comments - use curly brackets for all of them - e.g., {laugh}, {door slam}, {microphone noise}, {referring to speaker A}
Pause - .. - may occur alone in a time bin, or on a text line separated from neighboring words or comments by one space on either side of the dots: I see what you .. mean.
Noncanonical pronunciation - marked by prepending an apostrophe to a word. For example, 'microphone {PRN} means that the speaker pronounced the word in a way which seems unlikely to be recognizable by the recognizer. This is used for speech errors; not for a non-native's consistent versions of a word (since these could in principle be recognized by a suitably trained recognizer, whereas true speech errors could not).
Uncertainty - () - If a string is totally indecipherable, use (??). If the transcriber thinks the string might be "looks like velcro" but isn't sure, the transcriber should type these words, but enclose them in parentheses: (looks like velcro) If the transcriber is unsure of the words, but thinks it was n syllables long, use (nx), e.g., (1x) for one syllable, (3x) for three syllables, and so on.
Contrastive stress or Emphatic stress: *
This is *Adam, talking on mike *one, channel *zero.
or
I *do think so.

Stage 2 Conventions
Using exhaustive listings, and spell checking, and sed, etc. to massage the output of Stage 1.

Comments:

PRN (noncanonical, with respect to orthographic expectation): 'them {PRN "em"}
NVC (produced by nonvocal means): {NVC door slam}
VOC (produced by the vocal tract): {VOC laugh}
QUAL (comments on situation or speech): {QUAL the last two words were spoken while laughing}, {QUAL while whispering}, {QUAL referring to earlier meeting}, {QUAL End of meeting}

Acronyms or "techie" terms:

spelled: I_C_S_I, F_T_P, H_drive
spoken as words (not used consistently, yet): _ICSI
not yet checked: ICSI (For acronyms which can be either spoken or spelled, no underscore means "not yet checked.")

Small set (approx. 20) of spoken forms (cuz, etc., listed above, plus the following: ah, eh, ehm, hmm, huh, mm-hmm, nn-hnn, mmm, nnn, nope, nuh-uh, oh, ooo (rhymes with "cool"), oops, oy, ugh, uh, uh-huh, uh-uh (meaning no), um, whoa!, yeah, yep.
Numbers - all spelled out - five, twenty-nine
Digits - To enable separation of speech from the digits task from spontaneous speech - the tag {DGTS} is added to the end of words produced during the digits task. Lines which contain only pauses are not tagged in this way, but can be detected as pauses occurring between {DGTS} lines.
Stage 3 Conventions

A third stage of conventions involves extending XML encoding to capture events spanning multiple channels (e.g., door slams), or subsets of channels (independent simultaneous conversations) or larger stretches of times (e.g., topics). At present these are handled either in {QUAL} comments by individual speakers, or on the "default" channel of the multi-channel representation. The "default" channel is time synchronized with the sound wave but is not identified with a particular speaker. XML tags will enable these types of information to have an independent status from specific channels and time bins.

REFERENCE
Edwards, Jane A. (in press) "The Transcription of Discourse." In D. Tannen, D. Schiffrin, and H. Hamilton (eds). The Handbook of Discourse Analysis. NY: Blackwell.

Back to ICSI Speech Group home