Efficient Pitch-based Estimation of VTLN Warp Factors

This web page accompanies the EUROSPEECH 2005 paper by Arlo Faria and David Gelbart.

Matlab source code

learnpdf.m: a script for learning a conditional probability distribution P(warp factor | pitch) from a set of observed mean pitches and likelihood estimates P(observations | warp factor).
mappredict.m: a script for MAP estimation of warp factors through combining pitch-based and ML estimates (as described in section 3.2 of the paper). A probability weight [0,1] may be specified, such that a weight of 0.5 gives equal weight to the likelihood and the prior (as in the paper). A weight of 1.0 uses just the pitch-based prior.

Data

pitches-pu.list: a list of the mean pitches for the 6429 training set utterances. For the same list with the utterance filenames included, download pitches-pu-with-names.list.
pitches-ps.list: a list of the mean pitches, averaged per-speaker.
warppdfs-pu.list: a list of the acoustic likelihoods output by the reference GMM for each of the training utterances. The 6429 lines correspond to training set utterances in the same order as the pitch list above. The 16 log-scale likelihoods on each line correspond to that utterance warped at each warp factor from 0.70 to 1.30, at intervals of 0.04.
warppdfs-ps.list: a list of the acoustic likelihoods, using per-speaker normalization.

Example Usage


>> pitches = load ('pitches-ps.list'); 

>> likelihoods = load ('warppdfs-ps.list'); 

>> conditional = learnpdf(pitches, likelihoods); 

>> alphas = mappredict(pitches, likelihoods, conditional, 0.5)

Additional discussion

Here we provide some additional discussion that did not make it into the paper.

Gender-Dependent Acoustic Models

We have been asked how the proposed pitch-based estimation of VTLN warp factors would perform with gender-dependent acoustic models. We have not performed the experiments that would be necessary to answer this question, so we can only speculate. The range of common pitches in a single gender is much smaller than the range for both genders together, and so we expect a weaker correlation between pitch and VTL when considering one gender at a time. Our best guess is that pitch-based VTLN could still give accuracy gains over no VTLN in a gender-dependent system, but smaller ones than in a gender-independent system.

Variation of Pitch-Based VTLN Performance Between Speakers

We have shown good performance for pitch-based VTLN according to the metric of WER results averaged over speakers. Horacio Franco asked us whether there might be some particular speakers for which pitch-based VTLN does not work well, because pitch-based VTLN attempts to exploit a relationship between human VTL and human pitch which is only a correlation rather than a deterministic rule. Indeed, the question of whether pitch-based VTLN performance is more variable across speakers than ML VTLN performance is an interesting topic for future investigation. We considered using the standard deviation (across speakers) of per-speaker WER as a metric for this variability, but decided against it because in Numbers95 there are some speakers who provide limited data (as little as a single digit) making per-speaker WER volatile. So, if this question is to be investigated by performance measurement, we think it should be either with a different corpus or with a different metric (perhaps one which weights data points by the amount of speech they corresponds to).

Related work

Here we list some related work which we were not aware of when we submitted the paper.

Glavitsch 2003

Ulrike Glavitsch, "Speaker normalization with respect to F0: a perceptual approach", TIK Report Number 185, December, 2003. (Click paper name for link to paper.)

Liu, Zheng and Wu 2006

Jian Liu, Thomas Fang Zheng and Wenhu Wu, "Pitch Mean Based Frequency Warping" in the book Chinese Spoken Language Processing, Springer-Verlag, Germany, 2006. (From the Proceedings of the 5th International Symposium on Chinese Spoken Language Processing (ISCSLP 2006), Singapore, December 13-16, 2006.)

Another Low-Computation VTLN Method

We have been told of a published method for low-computation VTLN in which spectral center-of-gravity measurements are used to calculate a VTLN warp factor. We do not have a citation for this, but if we find one we can publish it here.