The meeting recorder data consists of up to 16 simultaneous recordings headset and ambient microphones. The purpose of the headset microphones is to provide a 'clean' signal that can be used as a best-case for speech recognition, and to supply unambiguous information about who is speaking when. However, both of these ends have been compromised by the problem of crosstalk - one person's microphone picking up the speech of other speakers, typically their neighbors. Although this crosstalk is usually at a much lower level, it can be sufficiently loud to confuse recognizers listening to that one channel. The Lavalier lapel mic is particularly in this regard, offering something less than 10 dB advantage to the voice of the wearer.
We do not, however, have to deal with a single channel in isolation. When processing the meeting recordings for turn identification or recognition, we can refer to the other, simultaneoulsy-recorded channels to improve performance. In particular, we would expect the crosstalk pickup of a neighbor's voice to be very closely related to signal received by the close-talking mic worn by that neighbor.
One attractive avenue to ameliorating the negative effects of crosstalk would be to modify the 'clean' version of the neighbor's speech picked up at their local microphone, and use it to exactly cancel the crosstalk version being picked up by the microphone under consideration. If this coupling can be accurately estimated, it should be possible to recover something very close to 'pure' versions of each speaker's voice, with crosstalk essentially removed. These pure signals would make speaker activity trivial, and would be the most suitable basis for recognition of each speaker. The the goal of crosstalk cancellation is to estimate the coupling and derive these signals.
One simple but powerful way to achieve crosstalk cancellation is the so-called 'Block Least Squares' techniqe described in (Woudenberg et al. 1999). Solving for the minimum-squared-error linear predictor of signal Y using signal X gives an expression for an FIR coupling filter that is the inverse of a Toeplits autcorrelation matrix for X multiplied by the cross-correlation between X and Y. The number of lag values for which the auto and cross-correlations are calculated determines the length of the estimated filter. Over the time window used in the calculation of the correlations (which can be very much larger than the number of lag values sampled), the resulting filter can optimally reduces the energy of Y by subtracting the filtered version of X, just as we wish to do.
However, the filter so estimated will contain artefacts if there is energy present in the target mic that has a chance correlation with the reference channel. To some extent, this is made unlikely by estimating the cancellation over a long time frame; however it can still occur. In this case, the filter is able to 'reuse' some of the energy from X to cancel out energy in Y that is not in fact crosstalk, such as the local speaker's voice. This is of course undesirable. Typically, some kind of parameter 'freezing' is applied when significant energy from the local source is detected to avoid this effect.
In general, basing the estimate of the coupling filter on a longer sample of the target and reference channels will result in a more accurate estimate - but only if the coupling is constant over that sample. If the coupling can change with time, the estimation time window needs to be a compromise between long duration (for an accurate estimate) and short duration (to ensure that the filter is regularly updated to track its changing property, and because the estimate of a single filter to account for coupling that varies with time will not achieve high cancellation). Unfortunately, the meeting recorder scenario results in highly variable cross-talk coupling: whenever participants move their heads, both the nature of the signals of their own voices as picked up by other microphones, but most dramatically the pickup of their own head-mounted mic of the other speakers, can change significantly. A brief consideration of the behavior in meetings suggests that people very often turn to face someone just after they begin to speak - i.e. just at the time we would want to be estimating the coupling of that voice to the now-moving microphone. Thus, we expect the cancellation problem to be challenging.
The examples below are drawn from the meeting of 2000-11-02-1440. From t=17 min in that meeting there is a period of high overlap that was selected by Thilo as a standard test set for overlap-handling techniques. First, we attempt to cancel one speaker from the lapel mic:
Audio: close (chan 3) - lapel (chan 0) - prediction 3»0 - residual (dynamic pred) - residual (static pred)
The top pane shows the clean 'reference' signal, from the head-mounted mic on the active speaker. The second pane is the original lapel mic signal, in which the cross-talk is clearly audible. The third panel is the result of block-adaptive echo cancellation, where the filter is being re-estimated every 0.25 s (with a little bit of temporal smoothing). The bottom pane shows the result of cancellation using a static filter, formed as the average filter over the 38 frames in the previous pane. Notice how the phrase starting just after t=2s bleeds through a little in the 3rd pane for a few 10s of ms until it is cancelled in the next block. It seems there was a change in the coupling at that time with either source or listener moving. The average filter does slightly worse at cancelling over all (since it is a compromise), but manages to avoid this little blip. (Click on any pane to hear the signal. The lower three have been boosted by 20dB to be more comparable to the reference. All signals have been pre-emphasized by [1 -0.95], so sound very 'trebly'.)
The change in optimal coupling is visible in the plot below:
Here, the top image is the successive impulse responses drawn as 38 grayscale strips against a horizontal time axis. The second pane shows the average across all 38 rows, and the third pane shows its Fourier transform i.e. the frequency domain coupling of the active speaker (as recorded by his close-talk mic) to the lapel mic. If you look carefully at the top pane, you can see a change in the 'texture' just above t=2s. This corresponds to the poorly-cancelled blip in the previous spectrograms, and would appear to indicate the moment at which one of the participants moved, or somehow the impulse response coupling slightly but abruptly changed.
[more examples to follow - what happens during real crosstalk?]
E. Woudenberg, F. Soong & B. Juang (1999). "A Block Least Squares approach to acoustic echo cancellation," Proc. ICASSP-99, Phoenix, vol. 2 pp. 869-872.
Back to DAn's Meeting Recorder index - ICSI Meeting Recorder homepage - DAn's homepage - ICSI Realization group homepage