ICSI Speech FAQ:
2.4 What are the different speech corpora at ICSI or elsewhere?

Answer by: dpwe - 2000-07-24


Current speech recognition works by training statistical pattern recognizers over large numbers of examples; to build a recognizer, we need a 'training set' for this purpose, and the larger it is, and the more it resembles the actual data we want to recognize (i.e. the test data), the better things will work.

As a result, we have quite a number of speech corpora (databases of speech waveform files along with, at a minimum, reference word transcripts for each) here at ICSI. They cover different kinds of tasks (limited vs. large vocabulary), styles of speaking (read vs. sponaneous) and acoustic conditions (studio vs. telephone etc.).

Here's a brief overview of the most important/active corpora currently available at ICSI:

NameSizeDescription
Switchboard
/u/drspeech/data/swbd
100 hours+ Current DARPA/NIST transcription task. Recordings are telephone conversations between two strangers talking on an assigned topic. Speech is highly informal and the task is very difficult. Best systems get WERs of around 25%.
Broadcast News
/u/drspeech/data/bn
142 hours DARPA/NIST task; recordings are actual U.S. television and radio news broadcasts, including all kinds of stuff ranging from professionally read speech to on-the-spot interviews against a noisy background to commercials. Wide range of conditions results in systems that work for an unusually wide range of test conditions. Best systems get around 15% WER; we get around 25%.
Aurora (noisy TIDIGITS)
/u/drspeech/data/aurora
4 hours ETSI task defined for noise-robust cell-phone feature evaluation. Takes the original TIDIGITS (continuous digit strings like "one eight five eight") and adds different kinds of noise at different SNRs from clean to 0dB for a multicondition training set. WERs are under 2% in clean to around 50% at 0dB SNR.
NUMBERS95
/u/drspeech/data/NUMBERS95
2 hours Continuous numbers strings (e.g. "one hundred and eighty five") extracted from more general speech, collected over the phone by OGI. For a long time, this was our standard small-vocabulary task: we have a lot of different results. Best performance has been pushed down below 5%.
BBC news
/u/drspeech/data/bbc
48 hours U.K. English Broadcast News, collected from BBC domestic radio and TV as part of the European THISL project.
TIMIT
/u/drspeech/data/TIMIT
2.5 hours One of the original common databases, a high-quality, phonetically-balanced, multi-speaker, multi-accent corpus. Because the vocabulary is so large, word recognition is not generally reported. The whole set is hand-marked at the phone level, so the data is often used for phone classification results. Other variants include NTIMIT (Network timit, filtered as if collected over the telephone) and HTIMIT (Handset timit, filtered with lots of different telephone handset characteristics).
BeRP
/u/drspeech/data/berp
About 5 hours The Berkeley Restaurant Project, a corpus collected at ICSI of people interacting with a 'wizard-of-oz' restaurant information kiosk, trying to find out about local restaurants. Formed the basis of our BeRP speech recognition demo.
Meeting Recorder
/u/drspeech/data/mtgrcdr
7 hours and counting New database being collected at ICSI (with likely collaborations including U. Washington, SRI, IBM etc.) of real meetings. Key feature is multichannel simultaneous recordings, so the same speech can be compared on a headset mic recording and from a tabletop mic. Data is currently being transcribed; no recognition results yet.
Phonebook
/u/drspeech/data/phonebook
? Large-vocablary isolated word task, widely used.
Verbmobil
/u/drspeech/data/vm
? Verbmobil was the German national project on automatic translation of face-to-face speech; task was making meeting arrangements. We are involved in SmartKom, the follow-on project, working on some kind of English subset (??).
OGI Stories
/u/drspeech/data/stories
3 hours Another database collected and transcribed by OGI. Spontaneous monologues collected over the phone: caller is instructed just to keep talking for 60 seconds. Full phonetic coverage, natural speech, often used as a 'general speech' pool. (This is the English portion, but many other languages have been collected).

For more details on what you might find in each of the /u/drspeech/data directories, see the information on the ideal drspeech directory.


Previous: 2.3 Why do we use connectionist rather than GMM? - Next: 2.5 Tell me more about the ICSI/Speech file system resources.
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:13 PDT 2009