Current speech recognition works by training statistical pattern recognizers over large numbers of examples; to build a recognizer, we need a 'training set' for this purpose, and the larger it is, and the more it resembles the actual data we want to recognize (i.e. the test data), the better things will work.
As a result, we have quite a number of speech corpora (databases of speech waveform files along with, at a minimum, reference word transcripts for each) here at ICSI. They cover different kinds of tasks (limited vs. large vocabulary), styles of speaking (read vs. sponaneous) and acoustic conditions (studio vs. telephone etc.).
Here's a brief overview of the most important/active corpora currently available at ICSI:
Name | Size | Description |
---|---|---|
Switchboard /u/drspeech/data/swbd |
100 hours+ | Current DARPA/NIST transcription task. Recordings are telephone conversations between two strangers talking on an assigned topic. Speech is highly informal and the task is very difficult. Best systems get WERs of around 25%. |
Broadcast News /u/drspeech/data/bn |
142 hours | DARPA/NIST task; recordings are actual U.S. television and radio news broadcasts, including all kinds of stuff ranging from professionally read speech to on-the-spot interviews against a noisy background to commercials. Wide range of conditions results in systems that work for an unusually wide range of test conditions. Best systems get around 15% WER; we get around 25%. |
Aurora (noisy TIDIGITS) /u/drspeech/data/aurora |
4 hours | ETSI task defined for noise-robust cell-phone feature evaluation. Takes the original TIDIGITS (continuous digit strings like "one eight five eight") and adds different kinds of noise at different SNRs from clean to 0dB for a multicondition training set. WERs are under 2% in clean to around 50% at 0dB SNR. |
NUMBERS95 /u/drspeech/data/NUMBERS95 |
2 hours | Continuous numbers strings (e.g. "one hundred and eighty five") extracted from more general speech, collected over the phone by OGI. For a long time, this was our standard small-vocabulary task: we have a lot of different results. Best performance has been pushed down below 5%. |
BBC news /u/drspeech/data/bbc |
48 hours | U.K. English Broadcast News, collected from BBC domestic radio and TV as part of the European THISL project. |
TIMIT /u/drspeech/data/TIMIT |
2.5 hours | One of the original common databases, a high-quality, phonetically-balanced, multi-speaker, multi-accent corpus. Because the vocabulary is so large, word recognition is not generally reported. The whole set is hand-marked at the phone level, so the data is often used for phone classification results. Other variants include NTIMIT (Network timit, filtered as if collected over the telephone) and HTIMIT (Handset timit, filtered with lots of different telephone handset characteristics). |
BeRP /u/drspeech/data/berp |
About 5 hours | The Berkeley Restaurant Project, a corpus collected at ICSI of people interacting with a 'wizard-of-oz' restaurant information kiosk, trying to find out about local restaurants. Formed the basis of our BeRP speech recognition demo. |
Meeting Recorder /u/drspeech/data/mtgrcdr |
7 hours and counting | New database being collected at ICSI (with likely collaborations including U. Washington, SRI, IBM etc.) of real meetings. Key feature is multichannel simultaneous recordings, so the same speech can be compared on a headset mic recording and from a tabletop mic. Data is currently being transcribed; no recognition results yet. |
Phonebook /u/drspeech/data/phonebook |
? | Large-vocablary isolated word task, widely used. |
Verbmobil /u/drspeech/data/vm |
? | Verbmobil was the German national project on automatic translation of face-to-face speech; task was making meeting arrangements. We are involved in SmartKom, the follow-on project, working on some kind of English subset (??). |
OGI Stories /u/drspeech/data/stories |
3 hours | Another database collected and transcribed by OGI. Spontaneous monologues collected over the phone: caller is instructed just to keep talking for 60 seconds. Full phonetic coverage, natural speech, often used as a 'general speech' pool. (This is the English portion, but many other languages have been collected). |
For more details on what you might find in each of the /u/drspeech/data directories, see the information on the ideal drspeech directory.
Previous: 2.3 Why do we use connectionist rather than GMM? - Next: 2.5 Tell me more about the ICSI/Speech file system resources.
Back to ICSI Speech FAQ index
Generated by build-faq-index on Tue Mar 24 16:18:13 PDT 2009