Some resources for noise-robust and channel-robust speech processing

This page is a collection of links to software and data resources related to research on automatic speech recognition (ASR) that is robust to background noise and convolutional distortions such as reverberation. Some of the links pointed to by this page are also relevant to research on enhancing speech for human listening. This page has now been replaced by the Resources listings at www.isca-students.org. However, this page should stay up, because the URL for this page has been referenced in at least one published paper. (In that paper, in IEEE Trans. Speech and Audio Processing in 2006, it was referenced as the place to download the Qualcomm-ICSI-OGI tools.) As of February 2007 there are no working links on this page which are not included in the ISCA listings (although there are a few dead links which might start working again), and there are a lot of links in the ISCA listings which are not included on this page.

Successful approaches to robust ASR may combine more than one robustness technique. Because of the simple data flow of much signal processing code, different tools can often be used together simply by running them in sequence, using pipes or intermediate files. Two convenient choices for intermediate file formats are HTK feature files, and waveforms. Many of the tools online here operate on HTK feature files, or can output HTK feature files. The HTK format is a useful intermediate file format for feature files because it is simple to read, write, and convert to other formats, and because of the popularity of HTK. Also, some algorithms can be used with other tools without any modification to those other tools by having the algorithms run speech-enhancement-style, outputting processed waveforms which the other tools treat as they would any other audio input file. Using processed waveforms as an intermediate format also allows listening, waveform plotting, and spectrogram plotting, which may lead to useful insights. If using processed waveforms as an intermediate format, it is often safest to store these processed waveforms in floating point, rather than the usual 16-bit integer storage format, to reduce roundoff error and eliminate the risk of overflow (numbers too large to represent in the 16-bit format) or underflow (numbers too small to represnt in the 16-bit format). Since processing algorithms may increase the loudness of some waveforms or introduce quiet details such as noise floors, there is a risk of overflow or underflow with a 16-bit integer format even if the original waveforms were well scaled for that format.

Enhancement/compensation software for ASR and human listening:

Software for ASR:

Software for signal quality measurement:

Software and data for reproducing or simulating acoustic conditions:

Other:

VOICEBOX

The VOICEBOX Matlab toolbox for audio processing includes a noise reduction routine (specsubm), routines to read and write audio files from Matlab, and many other things.

Beamforming Toolkit

The Karlsruhe beamforming toolkit: "btk is a toolkit that provides a basis for the implementation of powerful beamforming algorithms. btk uses Python as a scripting language for ease of control and modification. The capacity to efficiently perform advanced numerical computations is provided by Numeric Python (NumPy), the GNU Scientific Library (GSL), as well as a few extra algorithms we've implemented ourselves."

Qualcomm-ICSI-OGI front end, speech detection, and noise reduction

Click here for an archive containing source code and documentation for the Qualcomm-ICSI-OGI noise-robust front end described in the ICSLP 2002 paper by Adami et al. The archive also contains tools for using the speech detection, Wiener filter noise reduction, or nonspeech frame dropping features of the front end independently of other features. The noise reduction can be used independently of other components to produce noise-reduced waveforms.

Matlab noise reduction tools by Patrick Wolfe

Matlab source code for various noise reduction algorithms is available here.

UCL Enhance

Software and literature references for this speech enhancement tool are available here.

CtuCopy

CtuCopy is an open source tool for speech enhancement and ASR feature extraction. "CtuCopy acts as a filter with speech waveform file(s) at the input and either a speech waveform file(s) or feature file(s) at the output." As of version 3.0.7, it can be used for several different noise reduction techniques in the spectral subtraction family, and several ASR feature extraction algorithms. It was written by Petr Fousek of the Czech Technical University in Prague's Speech Processing and Signal Analysis Group. As of this writing CtuCopy version 3.07 is available at http://www.idiap.ch/~fousek/ctucopy/ and in the future it should be available in the download section at http://noel.feld.cvut.cz/speechlab (there is currently an older version of CtuCopy at the second link).

Trausti Kristjansson

Trausti Kristjansson created this page (while at the University of Toronto) which provides Matlab source code for (1) spectral subtraction noise removal, (2) the Algonquin variational inference algorithm for removing noise and channel effects, and (3) the Recognition Analyzer diagnostic tool which displays features, HTK log likelihoods, and HTK state sequences and can create resynthesized audio from MFCC features.

Marc Ferras' code for multi-microphone speech enhancement

This page provides source code for several blind multi-microphone speech enhancement techniques. These were implemented by Marc Ferras while pursuing his masters thesis on multi-microphone signal processing for automatic speech recognition in meeting rooms.

The RESPITE CASA Toolkit

The RESPITE CASA Toolkit is a toolkit for Computational Auditory Scene Analysis (CASA). This includes a tutorial on using the toolkit for missing data speech recognition.

Seneff auditory model

This page has source code for an implementation of Stephanie Seneff's auditory model front end for ASR.

RASTA and MSG

C/C++ implementations of the RASTA and MSG (modulation-filtered spectrogram) algorithms for robust feature extraction are available as part of this ICSI speech software package. There is also this older page for RASTA at ICSI. There is a MATLAB implementation of RASTA at Dan Ellis' Matlab page.

MVA (Mean, Variance, ARMA)

This page provides source code for this technique proposed by Chia-Ping Chen and Jeff Bilmes which post-processes noisy cepstra by doing mean and variance normalization (M for mean, V for variance) and bandpass modulation filtering (A for ARMA).

Gabor filter analysis for speech recognition

This page provides articles, filter definitions, software tools, and discussion related to automatic speech recognition (ASR) with Gabor filters. A Matlab package for feature selection using the Feature Finding Neural Networks (FFNN) approach proposed by Tino Gramß (Gramss) is available as well. (This FFNN package was used to select Gabor filters for ASR.)

NIST Speech Quality Assurance Package (SPQA)

The SPQA package includes SNR measurement tools which do not require a clean audio reference.

Objective Speech Quality Assessment

The CSLU Robust Speech Processing Laboratory software repository page hosts the Objective Speech Quality Assessment package (developed by Bryan Pellom, and analyzed in an ICSLP 98 paper by Hansen and Pellom) which calculates various metrics of speech quality based on comparing clean audio with noisy or noise-reduced audio.

CTU snr tool

This open source tool can be used both to measure the SNR of signals and to mix noise into signals at a specified SNR. It is available from the Czech Technical University in Prague's Speech Processing and Signal Analysis Group in the download section at http://noel.feld.cvut.cz/speechlab.

FaNT tool for adding noise or telephone characteristics to speech

The FaNT (Filtering and Noise-adding Tool) tool can be used to add noise to speech recordings at a desired SNR (signal-to-noise ratio). The SNR can be calculated using frequency weighting (G.712 or A-weighting), and the speech energy is calculated following ITU recommendation P.56. The tool can also be used to filter speech with certain frequency characteristics defined by the ITU for telephone equipment. This tool was used to create the noisy data for the popular AURORA 2 speech recognition corpus.

Acoustic impulse responses

This page, created by James Hopgood, is a collection of acoustic impulse responses for simulating convolutional distortion. The focus is on hands-free / far-field acoustic conditions. Some past speech recognition work (by Shire, Kingsbury, Avendano, Palomaki, Morgan, Chen, Gelbart, possibly others) has been done using a set of impulse responses collected using four mics and various degress of reverberation in the varechoic chamber at Bell Labs. They can be downloaded here. Another set of impulse responses from the Bell Labs varechoic chamber, using 31 speaker positions and a linear 22-element microphone array, has been made available by Aki Harma here. More acoustic impulse responses from various rooms, including microphone array situations, are available as part of the Sound Scene Database in Real Acoustical Environments from the Real World Computing Partnership, here. The site noisevault.com has acoustic impulse responses as well as links to software and documents regarding impulse response measurement and acoustic simulation; it seems aimed at audio engineers and audio engineering hobbyists. Three impulse responses measured in two meeting rooms are available as part of ITU-T G.191 Annex A, also known as the ITU-T Software Tools Library (STL). These impulse responses are in the 2005 STL release (STL2005) but not in earlier releases. The acoustics of the meeting rooms are described in the STL users manual.

University of Kentucky Microphone Array Processing Toolbox

This Matlab toolbox allows simulation of different room geometries, microphone locations, and speaker locations. It also includes routines for microphone array sound processing, microphone placement calculation, measuring RT60, and measuring sound velocity.

Room acoustics simulator

The AudioGroup at the University of Patras have placed public domain room acoustics simulators online here.

Additive Noise Sources

The CSLU Robust Speech Processing Laboratory software repository page hosts a package named Additive Noise Sources which contains noise files for use in simulating additive noise.

NOISEX noises

This page at Rice has a set of downloadable noises. I think these are from the NOISEX-92 collection, but I don't know if this is the complete collection. I am not trying to give a comprehensive list of corpora on this page, but this page in the comp.speech FAQ has some links.

ShATR multiple simultaneous speaker corpus

Here. "ShATR is a corpus of overlapped speech collected by the University of Sheffield Speech and Hearing Research Group in collaboration with ATR in order to support research into computational auditory scene analysis. The task involved four participants working in pairs to solve two crosswords. A fifth participant acted as a hint-giver. Eight channels of audio data were recorded from the following sensors: one close microphone per speaker, one omnidirectional microphone, and the two channels of a binaurally-wired mannekin. Around 41% of the corpus contains overlapped speakers. In addition, a variety of other audio data was collected from each participant. The entire corpus, which has a duration of around 37 minutes, has been segmented and transcribed at 5 levels, from subtasks down to phones. In addition, all nonspeech sounds have been marked."

Meeting room digits recordings

Here is a set of connected digits (like TIDIGITS) recordings made with table-top microphones in a conference room at the International Computer Science Institute. (This audio data is also available from the LDC as part of the ICSI Meeting Corpus.)

Pitch and Voicing Estimates for Aurora 2

The Pitch and Voicing Estimates for Aurora 2 archive from Microsoft Research "consists of a set of pitch period and voicing estimates for utterances found in the Aurora 2 corpus". The algorithm used was described in J. Droppo and A. Acero, "Maximum a Posteriori Pitch Tracking" in ICSLP 1998.

A brief list of resources that are not specific to noise and channel robustness

ISCA resources page, WaveSurfer speech visualization tool (view waveforms, spectrograms, formant tracks, pitch tracks) and other KTH-hosted software, HTK recognizer, SPHINX recognizer , JULIUS recognizer, Edinburgh Speech Tools Library, ISIP recognizer and ISIP Foundation Classes for speech processing, CSLR SONIC recognizer, CMU-Cambridge Statistical Language Modeling toolkit, SRILM - The SRI Language Modeling Toolkit, ICSI speech software package (link above under "RASTA and MSG"), COLEA Matlab Tool for Speech Analysis, some more links to tools here. For collections of speech recordings: Linguistic Data Consortium, European Language Resources Association (despite the name they have non-European languages too), Corpora Group at CSLU, VoxForge (recordings available at no cost), LibriVox (recordings available at no cost), VoxForge's list of corpora, and also see the Corpora listings on the ISCA Students web site

A list of phonetics tutorials and speech processing tutorials and software