This web page accompanies the EUROSPEECH 2005 paper by Arlo Faria and David Gelbart.
>> pitches = load ('pitches-ps.list');
>> likelihoods = load ('warppdfs-ps.list');
>> conditional = learnpdf(pitches, likelihoods);
>> alphas = mappredict(pitches, likelihoods, conditional, 0.5)
Here we provide some additional discussion that did not make it into the paper.
We have been asked how the proposed pitch-based estimation of VTLN warp factors would perform with gender-dependent acoustic models. We have not performed the experiments that would be necessary to answer this question, so we can only speculate. The range of common pitches in a single gender is much smaller than the range for both genders together, and so we expect a weaker correlation between pitch and VTL when considering one gender at a time. Our best guess is that pitch-based VTLN could still give accuracy gains over no VTLN in a gender-dependent system, but smaller ones than in a gender-independent system.
We have shown good performance for pitch-based VTLN according to the metric of WER results averaged over speakers. Horacio Franco asked us whether there might be some particular speakers for which pitch-based VTLN does not work well, because pitch-based VTLN attempts to exploit a relationship between human VTL and human pitch which is only a correlation rather than a deterministic rule. Indeed, the question of whether pitch-based VTLN performance is more variable across speakers than ML VTLN performance is an interesting topic for future investigation. We considered using the standard deviation (across speakers) of per-speaker WER as a metric for this variability, but decided against it because in Numbers95 there are some speakers who provide limited data (as little as a single digit) making per-speaker WER volatile. So, if this question is to be investigated by performance measurement, we think it should be either with a different corpus or with a different metric (perhaps one which weights data points by the amount of speech they corresponds to).
Here we list some related work which we were not aware of when we submitted the paper.
Ulrike Glavitsch, "Speaker normalization with respect to F0: a perceptual approach", TIK Report Number 185, December, 2003. (Click paper name for link to paper.)
Jian Liu, Thomas Fang Zheng and Wenhu Wu, "Pitch Mean Based Frequency Warping" in the book Chinese Spoken Language Processing, Springer-Verlag, Germany, 2006. (From the Proceedings of the 5th International Symposium on Chinese Spoken Language Processing (ISCSLP 2006), Singapore, December 13-16, 2006.)
We have been told of a published method for low-computation VTLN in which spectral center-of-gravity measurements are used to calculate a VTLN warp factor. We do not have a citation for this, but if we find one we can publish it here.