How to verify STP datasets

DAn Ellis <dpwe@icsi.berkeley.edu> 1996aug07

Introduction

In the Switchboard Transcription Project (STP), we have trained transcribers listening to snippets of real telephone conversations and using ESPS's xwaves/xlabel software to annotate them with phoneme labels (from a 60-symbol set) plus a few diacritics (for nasalized, creaky etc variants). This exercise is about producing large volumes of high-quality data, and thus an important but delicate stage of the operation is to take the files after the transcribers have finished with them, and use automatic means to detect and, as far as possible, correct any obvious or trivial errors before passing the data on to other consumers. This page is intended to document the sequence and variety of verification measures that we have developed over the course of this project.

Ten Easy Steps to Transcription Verification

Here is the situation: You have just decided to close the books on a particular transcription task. Your transcribers have been working on it for weeks, using stp-label-waves etc. The relevant lib/stp-status file details their progress.

Make sure they finish up and check back in their assignments. Keep badgering them until there are no "in-process" labels in the stp-status file.
Check for RCS-checked-out or RCS-missed files in the phn, wrd and com directories. To get a list of all the RCS checked-out files in a directory, do:

> rlog -L -R RCS/*

Files will have remained checked out because stp-label-waves crashed before checking them back in (hopefully rare). To check them in, the easiest thing is to break the lock, then check in the partially-edited files yourself - they are likely to be what you want. Thus:

> alias rercs "mv \!* /usr/tmp; rcs -l -M \!*; mv /usr/tmp/\!* .; ci -u -mFixBrokenRcsLock \!*"
> rercs 2780-A-0003.phn
RCS file: RCS/2780-A-0003.phn,v
Revision 1.3 is already locked by colleen.
1.3 unlocked
1.3 locked
done
RCS/2780-A-0003.phn,v  < - -  2780-A-0003.phn
new revision: 1.4; previous revision: 1.3
done

To find a list of files that are not in their correct RCS unlocked state, normally indicating that xlabel has written to them despite their being un-checked-out,

> find . -perm -200 -print

Generally, you want to keep these un-checked-out but modified files - they have been edited with xlabels when they weren't checked out, which at least leaves the trace of a writable, unlocked file. I go through them like this:

 > rcsdiff 3994-B-0021.phn

.. just to get a look at the differences relative to the checked-in one, then when I have confirmed that this one should be remembered in RCS log,

 > fixx 3994-B-0021.phn

.. where fixx is defined by:

 > alias fixx 'mv \!* /usr/tmp; co -l \!*; mv /usr/tmp/\!* .; ci -u \!*'

Now you can start looking for typos in the datasets. The Tcl routine MakeAllPhnHists in the file stp-readphn.tcl goes through building frequency-of-occurrence tables for the base phonemes, the diacritics, the full phonemes and the base-phoneme pairs. By looking for phonemes or diacritics that occur only once or twice in the corpus, typos should be easy to detect. You run these by the following:

> cd ~dpwe/projects/stp/scripts
> setenv STP_DATADIR /u/stp/data/sri-align-checked      # or whatever
> tclsh
% source stp-setup.tcl
% startup
% dohists

This will produce an output text file, phns.txt, containing all the frequency counts. This is formatted to be printed with genscript via:

> genscript -1R -fCourier7 phns.txt

In the process of parsing all the *.phn files to produce this report, the Tcl routines will also check for non-ascii characters in the labels. This is a recurrent problem because of xlabel's less-than-perfect treatment of the delete key - sometimes an invisible delete (^?) is inserted into labels. If these are reported, go back and edit them out with a text editor like emacs, or even with xlabels, where they can be seen as an extra space in the underline of the label, and deleted with the comment editor window (control-click on the label). Then rerun dohists on the delete-free dataset.

You can do a similar thing for the word transcriptions by running MakeAllWrdHists:

% MakeAllWrdHists $sfs

($sfs is set up by the routine 'startup'). This produces a frequency table of words and wordpairs, printed in a similar manner. Watch out for blank words here - very often these are unlabelled "h#"s.

If the word or phoneme frequency reports show something anomolous you wish to investigate, you can quickly scan for which file contains a certain instance with findphn or findwrd. For instance:

% findphn $sfs "*_vls*"

will list the file IDs (from the $sfs list) which contain phonemes matching the glob-style pattern specifying the (erroneous) "vls" diacritic. It will also display a summary of the context where it occurs.

You might want to search for files containing a very short number of phonemes - signal of a worthless file, or incorrect transcription. You can do this by running your own little tcl routine:

% foreach f $sfs {set ps [ReadPhns $f]; if {[llength $ps]<4} {puts $f}}

Next, you can apply some higher-level rules to the patterns in the phoneme files by running the syllabifier over the entire dataset. This reports on a variety of suspect constructs, such as repeated phones and extrasyllabic stops (stops which could not be attached to a syllable). Run this by:

% source syllify.tcl
% MakeSylLabs $sfs "" 0

Running it without the last '0' will actually write the syllable label files to syl/; running it with the last argument as -1 will not write the files, but will report the syllables to the display. Otherwise, as it is running, all the anomolies are printed to the screen. I have been copying these and saving them as lib/syllify.log for each dataset. Some will trigger immediate investigation.

If the syllabifier encounters a phoneme it doesn't recognize, it will report an error then hang. You'll have to interrupt the process and start again (after fixing the phoneme!).

That's it for verification. Remove any spurious files (i.e. files that occur in the phn, wrd, com directories that aren't named in the stp-status file). You can find these with something like:

% lwithoutl $fs $sfs

since the startup procedure sets $sfs to a list of all the ids in the stp-status file, and $fs to all the ids represented in the phn/ directory. Often these are empty (93 or 135 byte) label files created when stp-label-waves is invoked for an ID that doesn't exist (a behavior not shared by stp-label-waves.tcl).

Package up the label files ready for sending. I usually do this without the bulky *.wav files and without the RCS histories, e.g.

> cd /u/stp/data/
> tar cf - dev-test/{phn/*.phn,wrd/*.wrd,com/*.com} | gzip -c >! dev-test.tgz

You're set!

It's that simple.

1996aug21

Running through the sri-align-checked data, which has a number of poorly-segmented files which were rejected by the transcribers, I thought of another check - do a grep for 0000 (four zeros) in the label files, since the 'starting times' derived from the forced alignment have 10ms resolution (but are printed to 6 decimal places in the label files). A file with more than a few times ending in four zeros look suspiciously as if the labels have not been adjusted. Unfortunately, there are lots of cases of this...

1996aug22

Check out /u/dpwe/projects/stp/scripts/stp-verify.tcl for two new and useful functions: CheckLabelCount goes through a list of file IDs reporting those that have fewer than a certain number of labels in the (phn) file; FindUnmovedLabels checks a list of IDs for files that have more than (50%) of their time labels ending in 0000 (unmoved forced-alignment times).

1996oct31

Going through all the files again prior to releasing the 'final' versions of the 1995 STP files (dev-test, trn-chunk1.[12], sri-align.1). I became rather stricter with the phonemes: only the ICSI-56 plus "?" and "h#", so there should only be 58 (57 if no /eng/) - I changed a few "!"s to "?"s. For diacritics, just 15 allowed (14 if no _cl), including both _! and _?. I also fixed any remaining /dx/s without _d or _t appendage, after getting most of these patched from JHU. For the words, I made sure all truncations were indicated by complete words with _# suffixed, and comments to indicated aborted words. I tried to removed all parenthesized words - if the word was ambiguous, I replaced it with "?" and added a comment. I also fixed all the spurious quotes around words with apostrophes in using stp-readphn.tcl:FixDoubleQuotes (see also FixHeaderSlashes). Verification of words included using emacs's spell-buffer on the word-frequencies report, but seems like spellings were already corrected. All SIL's in phn and wrd files were changed to H#. I also tried to standardize on "'CAUSE" not "CAUSE", "CUZ" or "COS", but I'm afraid it may have been more often written "BECAUSE".

1996nov01

Things to tell future transcribers:

Be sure to mark truncated (either end) or aborted words as "WORD_#" (not forgetting the underscore). If the word cannot be determined, label it "?".
Have consistent conventions for transcribed filled pauses. JHU apparently prefers "UHHUH", "UMHUM" and "UHUH" (meaning "no").
I had started marking words that were evidently present in the sentence, but marked by pauses or other indistinct acoustics, as the word in parentheses. Don't do this; if the word belongs in the transcription, put it in, and attempt to allocate some time to it (it's marked by a pause, right?). You can add a comment that it exists only as an elongation. I didn't even mark them as "_!" (unusual pronounciation), although I suppose that would be reasonable

Last modified:
$Header: /n/crab/da/dpwe/public_html/stp/RCS/stp-verif.html,v 1.1 1997/03/04 00:00:13 dpwe Exp $

dpwe@icsi.berkeley.edu 1996aug12