Post-processing of dereverberation/denoising algorithms to reduce artifact noise, using a time-frequency mask (David Gelbart, 2004-2005)

I experimented with this approach in 2004-2005, but I never found the time to measure automatic speech recognition accuracy results for it. And since I left the speech research field years ago, I will not be doing them in the future. So I have put this description on the web, in case anyone is interested. If you try it out, please let me know your results so I can mention them here. You can contact me at david.gelbart@gmail.com.

Time-frequency masks have been found useful by various speech processing researchers to remove unwanted energy. One example is Shi and Aarabi's ICASSP 2003 paper (available online here; see here for code and more ASR results), where a mask of weights was applied to a time-frequency grid (one weight for each cell in the grid) by multiplying the values in the cells by the weights. (By time-frequency grid I mean a spectrogram-type representation.) The weights were set closer to 0 the more unwanted energy was judged to be in the cell.

Considering Shi and Aarabi's work, and reflecting on an observation that Michael Kleinschmidt made to me about how artifact noise in the output of a dereverberation algorithm might have been speech energy smeared out of place in the spectrogram, I decided to try the use of masking which I am describing on this web page. This web page describes a particular way to use a weight mask for removing artifact noise from the output of dereverberation/denoising algorithms in speech enhancement and speech recognition. The reason I came up with this is that I was concerned about artifacts in the output of the "long-term log spectral subtraction" (LTLSS) algorithm created by Avendano et al. A picture showing artifacts can be found in the Avendano et al. EUROSPEECH 1997 paper (the PDF here is better quality than the one on the ISCA site).

By artifact noise, I mean noise not present in the original signal that is present in the output of the algorithm. Artifact noise might be introduced by

temporal discontinuities in the effect of the algorithm (an explanation, suggested in Arslan, McCree and Viswanathan's ICASSP 1995 paper, for the "musical" tone noises introduced by some noise reduction algorithms; they proposed a way to fix the problem by reducing temporal discontinuity)
effects related to the use of windowed-DFT analysis/synthesis (an explanation, suggested by Hagai Attias, for artifact noise in the output of a source-separation algorithm he described in an April 2004 talk)
modifications made to the magnitude spectrum of signals without appropriate corresponding modification to the phase spectrum (an explanation, suggested by Michael Kleinschmidt, for the artifact noise in the output of my implementation of the LTLSS algorithm),
or other reasons.

To calculate a weight mask to remove artifact noise, I propose applying a rule of thumb based on the observation that dereverberation/denoising algorithms are meant to remove energy from places in the time-frequency grid. (Reverberation causes energy to spread to new, later places in the t-f grid. Additive noise adds energy.) The rule of thumb is there should not be energy in a t-f cell in the output that was not in that cell in the original (input) signal. If we see more energy in a cell in the output of the algorithm than was present in that cell in the input of the algorithm, this rule of thumb tells us to consider this energy as artifact noise.

This rule of thumb is not perfect, because

ideal dereverberation must shift delayed energy backward in time, causing the energy of some earlier t-f cells to increase
additive noise could have caused destructive interference (but if the noise is not correlated with the speech, I suppose it would be quite rare for this to have a large effect)
the dereverberation/denoising algorithm may alter the distribution of speech energy in a way that reduces the comparability of the input and output spectrogram (see discussion of spectral normalization, below)

I have achieved interesting results (an audible reduction in artifact noise) using a mask inspired by this rule of thumb to post-process the output of the LTLSS algorithm (pre-reverberation audio, reverberated input to LTLSS, output of LTLSS, output of LTLSS after the mask post-processing). The LTLSS algorithm, which is intended for automatic speech recognition rather than human listening, is based on using longer window lengths than the usual 20-32 ms window lengths of ASR, and this artifact noise removal algorithm has a bias towards making the output more like what you would get using a 32 ms window. However, this bias is only a "half-wave" bias in the sense of a half-wave rectifier so I am hoping it could remove some of the bad qualities (artifacts) of the long LTLSS windows while keeping some of the good qualities. But, I have not done ASR tests on this yet. I haven't even ruled out the possibility that this artifact noise removal might actually reduce ASR performance when applied in both training and test with LTLSS -- since I have often been using very clean training data to which introducing artifact noise might better prepare the system for reverberant test data -- so I would like to run two experiments, one in which it is applied to both training and test data, and one in which it is only applied to the test dta.

In a sense, this artifact noise removal algorithm is a way to address the trade off between a 32 ms window and a longer window; another way (which is quite different, conceptually) to address that trade off would be to choose a window length somewhere in between.

I also tried post-processing the audio examples from Nakatani and Miyoshi's ICASSP 2003 dereverberation paper; the results were mixed, with the post-processing seeming to improve quality in some cases, have a neutral effect in some cases, and worsen quality in some cases.

I have placed Matlab code to perform this post-processing here. (The code is archived with zip; it can be unpacked using unzip on a Unix or Linux system or WinZip, etc., on a Windows system.)

The file Go.m first loads two waveforms (the reverberated input to LTLSS and output of LTLSS linked to above). Then it calculate magnitude spectrograms, magnitude spectrograms with spectral normalization, and phase spectrograms for each waveform. The spectral normalization subtracts off the mean (over frames) of the log magnitude in each frequency bin. This normalization removes spectral tilt since after the normalization every frequency bin has the same mean log magnitude, zero (and correspondingly each bin has the same mean linear magnitude, one, using the geometric mean instead of the arithmetic mean).

To find the spectrogram cells where there is more energy after the LTLSS than before the LTLSS, the code subtracts the spectrally normalized magnitude spectrogram of the input to the LTLSS from that of the LTLSS output, setting negative values to zero:
diff(t,f) = max(0, outMagNormed(t,f) - inMagNormed(t,f)).
With some denoising/dereverberation algorithms, it will be better not to use the spectral normalization:
diff(t,f) = max(0, outMag(t,f) - inMag(t,f)).
I have put more discussion of this here

The code next defines a time-frequency mask:
mask(t,f) = 1 / (1 + gamma * diff(t,f)).

The mask is then applied to the magnitude spectrogram (the non-spectrally-normalized magnitude spectrogram in the following code fragment, but the spectrally-normalized magnitude spectrogram could also be used):
outmagMasked(t,f) = outMag(t,f) * mask(t,f).

Finally, outmagMasked is combined with the phase spectrogram of the LTLSS output to create a complex DFT spectrogram, from which a time-domain waveform is calculated.

Perhaps you have an algorithm with which you would like to try this approach yourself. Given input and output waveforms for some algorithm with an artifact noise problem, it should take only minutes to try the Matlab code for the first time. However, there are some parameters that may need adjustment.

First, the Matlab code performs time-frequency analysis using Hanning-windowed DFTs with a 32 ms window length and a 8 ms window advance. The 32 ms window length may not be optimal; I suspect a shorter window length will perform better sometimes. (32 ms corresponds to 256 point DFTs at 8000 Hz sampling rate, or 512 point DFTs at 16000 Hz. I haven't tested window lengths such that the DFT length is not a power of 2, although I think those would work, but run more slowly )

Second, there are many possible variations on how the masking weights are calculated, such as

adjustments to the value of gamma
compute diff in a different domain (e.g. squared magnitude instead of magnitude)
use f(diff) instead of diff (e.g. diff^2, exp(diff), exp(diff^2))
use a different weighting approach like binary weights (weights either 0 or 1)

In my experiments with the LTLSS I started with the way of calculating the weights that is currently in the Matlab code (i.e., using 1/(1+gamma*diff(t,f)) as the weighting factor) and I did not need to make any adjustments, except for increasing gamma from my starting value of gamma = 1. I found using gamma = 100 gave better performance, so now 100 is what is in the Matlab code. Perhaps a higher or lower value of gamma will be better for post-processing of other algorithms or other data.

If you try this post-processing method out, please let me know whether or not it helps. A tool like KTH's WaveSurfer can be very useful for judging the effects; WaveSurfer displays spectrograms clearly. I hope that noone overestimates the time required to perform the post-processing from running the provided Matlab code. The Matlab code performs spectral analysis via DFT, then the masking, then resynthesis via inverse DFT. Many dereverberation/denoising algorithms compute a magnitude spectrogram of the input and create a magnitude spectrogram of the output before going back to the time domain. Therefore, it may be possible to do the masking without introducing any additional DFTs and inverse DFTs. Furthermore, the Matlab code is redundantly operating on both positive and negative frequencies, even though they have equal magnitude spectral values for real-valued signals.

Be very careful writing the output audio to output files from Matlab since the processing may change the scale of samples, which could result saturation (where a sample gets too large to be represented as 16-bit samples in an audio file). Saturation can be very bad for listening quality and for automatic speech recognition performance. The easiest solution is to write the output as floating point instead of 16-bit samples.

Another, related approach to reducing artifact noise would be to use speech detection on the input and force silence in the output when there is no speech. A possible advantage of speech detection is that usable speech detection information may already be available in an application. (Any speech recognition application that does not use push-to-talk presumably include some kind of speech detection. Telephony applications may include speech detection as part of a compression or echo cancellation algorithm.) A possible advantage of the mask post-processing over using speech detection is that the masking is local in time and frequency while speech detection is only local in time. Another possible advantage is that the time resolution of the mask is higher than that of some speech detection algorithms. However, I don't know whether or not this increased locality and resolution would cause any improvement in the quality perceived by human listeners.

The principle used in this post-processing method is related to the principle of Additional Signal Attenuation During Nonspeech Activity described in "Suppression of Acoustic Noise in Speech Using Spectral Subtraction" by Boll (IEEE Trans. on Acoustics, Speech and Signal Processing, April 1979). Boll, considering residual noise rather than artifact noise, makes interesting comments about the possible value of preserving some residual noise.

An aside: the musical noise

With noise reduction by Wiener filtering / spectral subtraction, people struggled in the past with "musical noise" artifacts, and evolved some theory of the causes. Perhaps this example can provide inspiration to those dealing with artifacts in newer algorithms.

See for example:

Phil S. Whitehead, David V. Anderson, and Mark A. Clements, "Adaptive, acoustic noise suppression for speech enhancement", ICME 2003 (www.imtc.gatech.edu/projects/technology/media/icme2003.pdf)

If I recall correctly, in

L. Arslan, A. McCree and V. Viswanathan, "New Methods for Adaptive Noise Suppression", ICASSP 1995

the authors speculated the musical noise was caused by temporal discontinuity in the noise suppression filter (in other words, variations in the noise suppression filter from frame to frame). (For example, I think a noise suppression method which subtracts a noise spectral magnitude estimate can be formulated with some algebra as multiplying by a filter, allowing us to apply our understanding that a constant filter wouldn't cause those artifacts.)

Such discontinuity can be addressed by smoothing (low-pass filtering) of the filter. In

A. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S. Kajarekar, N. Morgan, and S. Sivadas "Qualcomm-ICSI-OGI Features for ASR" ICSLP 2002

the Wiener filter used for noise reduction was smoothed across both time and frequency and when I have listened to enhanced speech from this I have never heard any musical noise.

Here are other papers that were recommended to me on the topic of the musical noise problem:

Klaus Linhard and Heinz Klemm. Noise reduction with spectral subtraction and median filtering for suppression of musical tones. In Proc. of ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, pages 159-162, Pont-a-Mousson, France, April 1997.

Cappe, O., "Elimination of the musical noise phenomenon with the Ephraim and Malah noise Suppressor", IEEE Trans. on speech and audio processing, Vol. 2, No. 2, pp345~349, April 1994.