DAn Ellis <dpwe@icsi.berkeley.edu> 1996aug07
In the Switchboard Transcription Project (STP), we have trained transcribers listening to snippets of real telephone conversations and using ESPS's xwaves/xlabel software to annotate them with phoneme labels (from a 60-symbol set) plus a few diacritics (for nasalized, creaky etc variants). This exercise is about producing large volumes of high-quality data, and thus an important but delicate stage of the operation is to take the files after the transcribers have finished with them, and use automatic means to detect and, as far as possible, correct any obvious or trivial errors before passing the data on to other consumers. This page is intended to document the sequence and variety of verification measures that we have developed over the course of this project.
Here is the situation: You have just decided to close the books on a particular transcription task. Your transcribers have been working on it for weeks, using stp-label-waves etc. The relevant lib/stp-status file details their progress.
> rlog -L -R RCS/*
Files will have remained checked out because stp-label-waves crashed before checking them back in (hopefully rare). To check them in, the easiest thing is to break the lock, then check in the partially-edited files yourself - they are likely to be what you want. Thus:
> alias rercs "mv \!* /usr/tmp; rcs -l -M \!*; mv /usr/tmp/\!* .; ci -u -mFixBrokenRcsLock \!*" > rercs 2780-A-0003.phn RCS file: RCS/2780-A-0003.phn,v Revision 1.3 is already locked by colleen. 1.3 unlocked 1.3 locked done RCS/2780-A-0003.phn,v < - - 2780-A-0003.phn new revision: 1.4; previous revision: 1.3 done
To find a list of files that are not in their correct RCS unlocked state, normally indicating that xlabel has written to them despite their being un-checked-out,
> find . -perm -200 -print
Generally, you want to keep these un-checked-out but modified files - they have been edited with xlabels when they weren't checked out, which at least leaves the trace of a writable, unlocked file. I go through them like this:
> rcsdiff 3994-B-0021.phn
.. just to get a look at the differences relative to the checked-in one, then when I have confirmed that this one should be remembered in RCS log,
> fixx 3994-B-0021.phn
.. where fixx is defined by:
> alias fixx 'mv \!* /usr/tmp; co -l \!*; mv /usr/tmp/\!* .; ci -u \!*'
> cd ~dpwe/projects/stp/scripts > setenv STP_DATADIR /u/stp/data/sri-align-checked # or whatever > tclsh % source stp-setup.tcl % startup % dohists
This will produce an output text file, phns.txt, containing all the frequency counts. This is formatted to be printed with genscript via:
> genscript -1R -fCourier7 phns.txt
In the process of parsing all the *.phn files to produce this report, the Tcl routines will also check for non-ascii characters in the labels. This is a recurrent problem because of xlabel's less-than-perfect treatment of the delete key - sometimes an invisible delete (^?) is inserted into labels. If these are reported, go back and edit them out with a text editor like emacs, or even with xlabels, where they can be seen as an extra space in the underline of the label, and deleted with the comment editor window (control-click on the label). Then rerun dohists on the delete-free dataset.
% MakeAllWrdHists $sfs
($sfs is set up by the routine 'startup'). This produces a frequency table of words and wordpairs, printed in a similar manner. Watch out for blank words here - very often these are unlabelled "h#"s.
% findphn $sfs "*_vls*"
will list the file IDs (from the $sfs list) which contain phonemes matching the glob-style pattern specifying the (erroneous) "vls" diacritic. It will also display a summary of the context where it occurs.
% foreach f $sfs {set ps [ReadPhns $f]; if {[llength $ps]<4} {puts $f}}
% source syllify.tcl % MakeSylLabs $sfs "" 0
Running it without the last '0' will actually write the syllable label files to syl/; running it with the last argument as -1 will not write the files, but will report the syllables to the display. Otherwise, as it is running, all the anomolies are printed to the screen. I have been copying these and saving them as lib/syllify.log for each dataset. Some will trigger immediate investigation.
If the syllabifier encounters a phoneme it doesn't recognize, it will report an error then hang. You'll have to interrupt the process and start again (after fixing the phoneme!).
% lwithoutl $fs $sfs
since the startup procedure sets $sfs to a list of all the ids in the stp-status file, and $fs to all the ids represented in the phn/ directory. Often these are empty (93 or 135 byte) label files created when stp-label-waves is invoked for an ID that doesn't exist (a behavior not shared by stp-label-waves.tcl).
> cd /u/stp/data/ > tar cf - dev-test/{phn/*.phn,wrd/*.wrd,com/*.com} | gzip -c >! dev-test.tgz
It's that simple.
Last modified:
$Header: /n/crab/da/dpwe/public_html/stp/RCS/stp-verif.html,v 1.1 1997/03/04 00:00:13 dpwe Exp $