ICSI Speech FAQ:
4.1 How is the SNR of a speech example defined?

Answer by: dpwe - 2000-01-03

SNR (Signal-to-noise ratio) is a standard measure of the amount of background noise present in a speech (or other) signal. It is defined as the ratio of signal intensity to noise intensity, expressed in decibels, e.g.

    SNR_dB = 20.log10(S_rms / N_rms)

where S_rms is the root-mean square of the speech signal (without any noise present) i.e. sqrt(1/N*sum(s[n]^2)), and N_rms is the root-mean square level of the noise without speech. This is equal to:

    SNR_dB = 10.log10(S_e / N_e)

where S_e is the total energy of the speech i.e. sum(s[n]^2) etc. To distinguish it from the refined measure described below, we will call this the global SNR.

The difficulty of this measure comes from the highly nonuniform nature of the speech. Consider an utterance of 1 second duration; it has a certain energy E. We can construct a noise-corrupted version at a given SNR by finding some noise sample (say white noise, or a recording of ambience in a moving car) of the same duration, and scaling its level to obtain the desired SNR according to the above equations, then adding the two together.

If we then consider a second version of the speech example with 1 second of silence (zero-valued samples) added to make its total duration 2 seconds, its total energy is unchanged. However, to make a 2 second sample of the noise that has the same total energy as the 1 second example, we would need to reduce its amplitude by about 30% so that sum(n[n]^2) is the same when twice as many values of n are involved. This is a real problem: the actual level of noise added to achieve a given global SNR depends strongly on the amount of padding added to (or, in general, silence present in) the speech example. Much confusion has resulted from SNR levels quoted in papers that have fallen foul of this ambiguity.

The solution is to find a definition for speech SNR that does not vary when silence is added to the noise. One approach is to exclude silent or quiet portions of the speech from the calculation. We could call this local SNR, since it measures the SNR only over certain ranges of the signal. For instance, we could first calculate the energy profile of the speech over 25 ms frames. We then discard any windows whose energy is smaller than some fixed fraction (say 25%) of the highest-energy window as being 'silence'. The energy of the speech signal is then defined as the energy in the non-discarded frames (which will be only slightly less than the total energy of the sample, since we have discarded only low-energy frames).

Then, crucially, the energy of the noise signal is calculated over the same frames, or at least for a segment of noise that has the same total duration as the speech signal after the silent frames have been discarded. In this way, padding a speech signal with silence will have no effect on the amplitude of the noise added to achieve a particular SNR. Thus, the local SNR value measures the signal-to-noise ratio specifically for the non-silent portions of the signal. It is likely to be significantly higher than the the global SNR measured over the entire signal including silent regions.

A local SNR of 30dB is effectively a clean signal. Listeners will barely notice anything better than 20dB, and intelligibility is still pretty good at 0dB SNR (speech energy and noise energy the same). These numbers depend on the type of noise; competing speech or babble is the most disruptive for a given energy, since it matches the spectral distribution (and modulation dynamics) of the target speech. Conventional speech recognizers are much more sensitive than listeners, and typically show significantly increased word error rates at 20dB SNR.

If you want to estimate the SNR of a speech signal that already has noise added, you can exploit the property of speech containing many silent gaps to obtain 'glimpses' of the background noise alone. Assuming stationary noise, this then allows you to estimate the background noise level, and thus the total SNR. This is the basis of the SNR estimation procedure developed by Hans-Gunter Hirsch and described in ICSI Technical Report 93-012.

Note: I believe this is the convention used in all speech SNR calculations, but I have been unable to find a reference. If you know better, please let me know - dpwe@ee.columbia.edu .

2000-01-05: George Doddington <doddington@nist.gov> adds:
While the "duty cycle" of speech (i.e., how much silence is included) is a significant issue, an even more significant issue is the shape of the noise spectrum. This can vary wildly, with white noise being very bad and typical room noise being relatively benign. By this I mean that ASR degradation will be far greater for white noise than for low frequency noise at the same SNR value. This is true for human as well as ASR performance.

A common method for minimizing this variability is to weight the signal and noise energy as a function of frequency. There is a standard "A weighting" to achieve this, which can be approximated reasonably by measuring the energy in the first-order difference signal. Of course, without this frequency weighting, it becomes possible to show really good noise robustness numbers -- just use a noise signal with a lot of very low frequency energy! But for the scientifically inclined, frequency weighting is de rigueur.

2001-08-29: Dave Gelbart adds:
The NIST SPeech Quality Assurance (SPQA) Package includes programs for SNR estimation. It's available here.

Previous: 3.16 How should I structure the directories for my new task? - Next: 4.2 How do I convert a time in seconds into a frame index?
Back to ICSI Speech FAQ index

Generated by build-faq-index on Tue Mar 24 16:18:15 PDT 2009