Under Construction

Meeting Recorder: Portable Speech Recognition

Requirements

For Meeting Recorder to be useful, it must meet a number of challenging requirements. It must:

Record meetings in natural settings
Support multiple speakers
Allow for easy correction
Provide annotation capabilities
Support indexing and searching
Work stand-alone
Provide collaboration tools

Record Meetings in Natural Settings

One of the primary design goals of Meeting Recorder is to record real meetings in real settings. We would also like to include the ability to record impromptu meetings. The speech recognition must therefore work in uncontrolled acoustic environments including background noise (e.g. fans, music) and reverberation. The vocabulary must be large enough to cover the domain of the meeting. It must work with spontaneous speech. It must be robust to foreign accents.

Support Multiple Speakers

Obviously, meetings have multiple participants. The speech recognition system should not only transcribe the speech of different people, perhaps talking at once, but it also should note when the speaker changes. It would also be convenient if it could record not only the speaker change, but also the identity of the speaker. This would allow searching by speaker as well as by contents of the record.

Speech recognition systems can improve their accuracy by adapting to a particular user. Allowing the system to store and exchange user profiles among meeting participants could dramatically improve the quality of the transcript.

Allow for Easy Correction of the Transcript

The difficulty of the speech recognition task ensures that the transcript will contain many errors. Although perfect speech recognition is not required for Meeting Recorder to be useful (see below), the ability to inform the recognizer that it has made an error is still very desirable.

If the system has a good idea of the possible alternatives for a given error, it can present the list to the user. It may be much faster to select from among the alternatives to correct the transcript, rather than to delete the incorrect word and insert the correct one.

Also, the recognizer can adapt to the user more easily if the transcript contains few errors. Adaptation will improve the quality of the transcript. Finally, out-of-vocabulary words will appear in the transcript as errors. Correcting these errors provides the ability to add the out-of-vocabulary words to the system.

Provide Annotation Capabilities

We distinguish between correction and annotation. Correction is the process of informing the system that it has made an error. Annotation allows the user to modify, edit, and supplement the output.

Users may want to change the text of the transcript from what was really said to what was meant. They might also want to add textual notes at a particular place in the transcript. Doodling, diagraming, underlining, and circling are all very common activities in pen-and-paper note-taking, and should also be supported.

Synchronization of the annotation with the spoken record both provides additional context for the annotations and the content, and allows indexing and searching on the annotation.

Searching the Record

Perhaps the most useful aspect of having a transcript of a meeting is the ability to search the record. We plan to provide for searching both by speaker and text content. Since the actual audio will be stored, the user can play back the portion of the meeting that matches his criteria. We will support both textual and spoken queries.

Note that, since the audio record is stored, the transcript need not be perfect. It need only be good enough so that queries match where users expect them to match. For referring to the actual content, the user can play back the audio. In addition, the speech recognizer can output not just the most likely transcript, but also a list of the top few most likely hypotheses (called N-best lists in speech recognition). Queries can be made against these N-best list, rather than just the best hypothesis. Finally, it turns out that the recognizer does much worse with so-called function words (such as "the", "a", "of", "an") as opposed to content words. However, for text retrieval, systems usually ignore function words. For all of the above reasons, it is perfectly acceptable for the recognizer to be imperfect. In fact, we expect word-error rates of up to 40% to be acceptable.

Self-contained

If we want to support impromptu meetings in uninstrumented environments, it is necessary for the Meeting Recorder to be portable. Although it is certainly possible to use a hand-held computer as a terminal using a wireless network, we feel that a self-contained solution is better in the long run. The terminal/main-frame model has more components to fail - the terminal, the network, the wireless link, the main-frame, the infrastructure. The PC revolution has also shown us the utility of personal compute power.

Nevertheless, we plan to implement the first generation of Meeting Recorder with a wireless network to a workstation running speech recognition. In a later phase of the project, we will incorporate Vector IRAM, a high performance, low power vector processor being developed at UC Berkeley. IRAM in a PDA will allow us to run the speech recognition locally. See the IRAM pages for more information on IRAM.

Collaboration Tools

If multiple participants in a meeting have a Meeting Recorder, various types of collaboration are possible. Offline, after a meeting, transcripts and annotations could be shared among participants. Not only does this promote collaborative note taking (a la NotePals), but it also could improve the quality of the transcript. Since the Meeting Recorder in front of me will probably do a better job transcribing my voice, and the one if front of you will do a better job on yours, combining the two could improve the overall quality of the entire transcript.

Online, one can imagine several collaboration support tools. First, simple "chat" programs could be used, in which one user's annotations are sent to another user. Users' vocabularies and speech profiles could be exchanged in order to improve recognition quality. Finally, if all the Meeting Recorders in a room are connected via a wireless network, one could form a microphone array out of all of them. Microphone arrays have proven very effective in improving noise and reverberation robustness. Also, a microphone array can be used to detect where in the room a speaker is. This would aid speaker identification.

[ Home Page | Meeting Recorder Application | Other Issues ]