* Conversational Text-to-SQL: An Odyssey into State-of-the-Art and Challenges Ahead
SHK Parthasarathi, L Zeng, D Hakkani-Tur
To appear in Proceedings of IEEE ICASSP 2023.
Abstract
Conversational, multi-turn, text-to-SQL (CoSQL) tasks map natural language utterances in a dialogue to SQL queries. State-of-the-art (SOTA) systems use large, pre-trained and finetuned language models, such as the T5-family, in conjunction with constrained decoding. With multi-tasking (MT) over coherent tasks with discrete prompts during training, we improve over specialized text-to-SQL T5-family models. Based on Oracle analyses over n-best hypotheses, we apply a query plan model and a schema linking algorithm as rerankers...
[Read more]
* Fixed-point quantization aware training for on-device keyword-spotting
S Macha, O Oza, A Escott, F Caliva, R Armitano, SK Cheekatmalla, SHK Parthasarathi, Y Liu
To appear in Proceedings of IEEE ICASSP 2023.
Abstract
Fixed-point (FXP) inference has proven suitable for embedded devices with limited computational resources, and yet model training is continually performed in floating-point (FLP). FXP training has not been fully explored and the non-trivial conversion from FLP to FXP presents unavoidable performance drop. We propose a novel method to train and obtain FXP convolutional keyword-spotting (KWS) models. We combine our methodology with two quantization-aware-training (QAT) techniques - squashed weight distribution and absolute cosine regularization for model parameters, and propose techniques for extending QAT over transient variables, otherwise neglected by previous paradigms.
[Read more]
* N-Best Hypotheses Reranking for Text-To-SQL Systems
L Zeng, SHK Parthasarathi, D Hakkani-Tur
Proceedings of SLT 2022.
Abstract
On the well established Spider dataset, we begin with Oracle studies: specifically, choosing an Oracle hypothesis from a SOTA model's 10-best list, yields a 7.7% absolute improvement in both exact match (EM) and execution (EX) accuracy, showing significant potential improvements with reranking. Identifying coherence and correctness as reranking approaches, we design a model generating a query plan and propose a heuristic schema linking algorithm....
[Read more]
* Wakeword Detection under Distribution Shifts
SHK Parthasarathi, L Zeng, C Jose, J Wang
In Proceedings of TSD 2022.
Abstract
We propose a novel approach for semi-supervised learning (SSL) designed to overcome distribution shifts between training and real-world data arising in the keyword spotting (KWS) task. Shifts from training data distribution are a key challenge for real-world KWS tasks: when a new model is deployed on device, the gating of the accepted data undergoes a shift in distribution, making the problem of timely updates via subsequent deployments hard. Despite the shift, we assume that the marginal distributions on labels do not change.
[Read more]
* Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets
SHK Parthasarathi, L Zeng, Y Liu, A Escott, SK Cheekatmalla, N Strom, S Vitaladevuni
In Proceedings of TSD 2022.
Abstract
We propose a novel 2-stage sub 8-bit quantization awaretraining algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapta recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations.
[Read more]
* Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models
J Liu, RV Swaminathan, SHK Parthasarathi, C Lyu, A Mouchtaris, S Kunzmann
In Proceedings of TSD 2021.
Abstract
We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting...
[Read more]
* Lessons from Building Acoustic Models with a Million Hours of Speech
SHK Parthasarathi, N Strom
In Proceedings of ICASSP 2019.
Abstract
This is a report of our lessons learned building acoustic models from 1 Million hours of unlabeled speech, while labeled speech is restricted to 7,000 hours. We employ student/teacher training on unlabeled data, helping scale out target generation in comparison to confidence model based methods, which require a decoder and a confidence model. To optimize storage and to parallelize target generation...
[Read more]
* Realizing Petabyte Scale Acoustic Modeling
SHK Parthasarathi, N Sivakrishnan, P Ladkat, N Strom
In IEEE Journal 2019 on Emerging and Selected Topics in Circuits and Systems 2019.
Abstract
Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic models (AM)...
[Read more]
* Two Tiered Distributed Training Algorithm for Acoustic Modeling
P Ladkat, O Rybakov, R Arava, SHK Parthasarathi, IF Chen, N Strom
To appear in Proceedings of Interspeech 2019.
Abstract
We present a hybrid approach for scaling distributed training of neural networks by combining Gradient Threshold Compression (GTC) algorithm - a variant of stochastic gradient descent (SGD) - which compresses gradients with thresholding and quantization techniques and Blockwise Model Update Filtering (BMUF) algorithm - a variant of model averaging (MA).
[Read more]
* Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-Student Learning
L Mosner, M Wu, A Raju, SHK Parthasarathi, K Kumatani, S Sundaram, R Maas, B Hoffmeister
In Proceedings of ICASSP 2019.
Abstract
For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacherstudent (T/S) learning technique using
noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data...
[Read more]
* Robust Speech Recognition Via Anchor Word Representations
B King, I Chen, Y Vaizman, Y Liu, R Maas, SHK Parthasarathi, B Hoffmeister
In Proceedings of Interspeech 2017.
Abstract
A challenge for speech recognition for voice-controlled household devices, like the Amazon Echo or Google Home, is robustness against interfering background speech. Formulated as a far-field speech recognition problem, another person or media device in proximity can produce background speech that can interfere with the device-directed speech. We expand on our previous work on device-directed speech detection in the far-field speech setting and introduce two approaches for robust acoustic modeling. Both methods are based on the idea of using an anchor word taken from the device directed speech...
[Read more -
PDF]
* Anchored Speech Detection
R Maas, SHK Parthasarathi, B King, R Huang, B Hoffmeister
In Proceedings of Interspeech 2016.
Abstract
We propose two new methods of speech detection in the context
of voice-controlled far-field appliances...In the first method, we estimate the
mean of the anchor word segment and subtract it from the subsequent feature vectors. In the second, we use an encoder-decoder
network with features that are normalized by applying conventional log amplitude causal mean subtraction...
[Read more -
PDF]
* fMLLR based feature-space speaker adaptation of DNN acoustic models
SHK Parthasarathi, B Hoffmeister, S Matsoukas,
A Mandal, N Strom, S Garimella
Proceedings of Interspeech 2015.
Abstract
We investigate the problem of speaker adaptation of DNN
acoustic models in two settings: the traditional unsupervised
adaptation and a supervised adaptation (SuA) where a few minutes
of transcribed speech is available. SuA presents additional
difficulties when a test speaker's adaptation information
does not match the registered speaker's information. Employing
feature-space maximum likelihood linear regression (fMLLR)
transformed features as side-information to the DNN...
[Read more -
PDF]
* Robust i-vector based Adaptation of DNN Acoustic Model for
Speech Recognition
S Garimella, A Mandal, N Strom, B Hoffmeister,
S Matsoukas, SHK Parthasarathi
Proceedings of Interspeech 2015.
Abstract
In the past, conventional i-vectors based on a Universal Background
Model (UBM) have been successfully used as input features
to adapt a Deep Neural Network (DNN) Acoustic Model
(AM) for Automatic Speech Recognition (ASR). In contrast,
this paper introduces Hidden Markov Model (HMM) based ivectors
that use HMM state alignment information from an
ASR system for estimating i-vectors. Further, we propose passing
these HMM based i-vectors though an explicit non-linear
hidden layer of a DNN before combining them with standard
acoustic features...
[Read more -
PDF]
* An Introduction to Computational Networks and the Computational Network Toolkit
A Agarwal, E Akchurin, C Basoglu, G Chen, S Cyphers, J Droppo, A Eversole, B Guenter, M Hillebrand, R Hoens, X Huang, Z Huang, V Ivanov, A Kamenev, P Kranen, O Kuchaiev, W Manousek, A May, B Mitra, O Nano, G Navarro, A Orlov, M Padmilac, SHK Parthasarathi, B Peng, A Reznichenko, F Seide, M Seltzer, M Slaney, A Stolcke, H Wang, Y Wang, K Yao, D Yu, Y Zhang, G Zweig (in alphabetical order).
MSR-TR-2014-112 2014.
[Read more -
PDF]
* The Blame Game in meeting room ASR: An analysis of feature versus
model errors in noisy and mismatched conditions
SHK Parthasarathi, SY Chang, J Cohen, N Morgan, S Wegmann
Proceedings of ICASSP 2013.
Vancouver, Canada. May 2013.
Abstract
Given a test waveform, state-of-the-art ASR systems extract a sequence of MFCC features and
decode them with a set of trained HMMs. When this test data is clean, and it matches the
condition used for training the models, then there are few errors. While it is known that
ASR systems are brittle in noisy or mismatched conditions, there has been little work in
quantitatively attributing the errors to features or to models. This paper attributes the
sources of these errors in three conditions: (a) matched near-field, (b) matched far-field,
and a (c) mismatched condition. We undertake a series of diagnostic analyses employing the
bootstrap method to probe a meeting room ASR system....
[Read more -
PDF]
* Exploiting Innocuous Activity for Correlating Users Across
Sites
O Goga, H Lei, SHK Parthasarathi, G Friedland, R Sommer and R Teixeira
Proceedings of World Wide Web 2013. Rio de
Janeiro, Brazil. May 2013.
Abstract
We study how potential attackers can identify accounts on different social network sites
that
all belong to the same user, exploiting only innocuous activity that inherently
comes with posted content. We examine three specific features on Yelp, Flickr,
and Twitter: the geo-location attached to a user's posts, the timestamp of
posts, and the user's writing style as captured by language models...
[Read more -
PDF]
* Wordless Sounds: Robust Speaker Diarization using Privacy-Preserving Audio Representations
SHK Parthasarathi, H Bourlard and D Gatica-Perez
In IEEE
Transactions on Audio, Speech, and Language Processing, 21(1), 2013.
Abstract
This paper investigates robust privacy-sensitive audio features for speaker diarization in multiparty conversations: ie., a set of audio features having low linguistic information for speaker diarization in a single and multiple distant microphone scenarios. We systematically investigate Linear Prediction (LP) residual. (...). Next, we propose a supervised framework using deep neural architecture for deriving privacy-sensitive audio features...
[Read more - PDF]
* LP Residual Features for Robust, Privacy-Sensitive Speaker
Diarization
SHK Parthasarathi, H Bourlard and D Gatica-Perez
Proceedings of Interspeech 2011. Florence, Italy. Aug 2011.
Abstract
We present a comprehensive study of linear prediction residual for speaker diarization on single and multiple distant microphone conditions in
privacy-sensitive settings, a requirement to analyze a wide range of spontaneous conversations. Two representations of the residual are compared, namely real-cepstrum and MFCC, with the latter performing better...
[Read more - PDF]
* Speaker Change Detection with Privacy-Preserving Audio Cues
SHK Parthasarathi, M M.-Doss, D Gatica-Perez and H Bourlard
Proceedings of ICMI-MLMI 2009. MIT Media Labs, Cambridge, US. Nov 2009.
Abstract
In this work we investigate a set of privacy-sensitive audio features for speaker change detection (SCD) in multiparty conversations. These features are based on three different principles: characterizing the excitation source information using linear prediction residual, characterizing subband spectral information shown to contain speaker information, and characterizing the general shape of the spectrum...
[Read more - PDF]
* Privacy-Sensitive Audio Features for Speech/Nonspeech Detection
SHK Parthasarathi, D Gatica-Perez, H Bourlard and M M.-Doss
In IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2011.
Abstract
The goal of this paper is to investigate features for speech/nonspeech detection (SND) having low linguistic information from the speech signal. Towards this, we present a comprehensive study of privacy-sensitive features for SND in multiparty conversations. Our study investigates three different approaches to privacy-sensitive features...
[Read more - PDF]
* Robustness of Group Delay Representations for Noisy Speech Signals
SHK Parthasarathi, P Rajan, and H A Murthy
In IJST (Springer), 14(4), 2011.
Abstract
This paper demonstrates the robustness of group delay based features to additive noise. First, we analytically show the robustness of group delay based represen- tations. The analysis makes use of the fact that, for minimum-phase signals, the group delay function can be represented in terms of the cepstral coefficients of the log-magnitude spectrum. Such a representation results in the speech spectrum dominating over the noise spectrum, both at low and high SNRs...
[Read more - PDF]
* Evaluating the Robustness of Privacy-Sensitive Audio Features for
Speech Detection in Personal Audio Log Scenarios
SHK Parthasarathi, M M.-Doss, H Bourlard and D Gatica-Perez
Proceedings of ICASSP 2010. Dallas, US. March 2010.
Abstract
Personal audio logs are often recorded in multiple environments. This poses challenges for robust front-end processing, including speech/nonspeech detection (SND). Motivated by this, we investigate the robustness of four different privacy-sensitive features for SND, namely energy, zero crossing rate, spectral flatness, and kurtosis...
[Read more - PDF]
* Investigating privacy-sensitive features for speech detection in multiparty
conversations
SHK Parthasarathi, M M.-Doss, H Bourlard and D Gatica-Perez
Proceedings of Interspeech 2009. Brighton, UK. September 2009.
Abstract
We investigate four different privacy-sensitive features, namely energy, zero crossing rate, spectral flatness, and kurtosis, for speech detection in multiparty conversations. We liken this scenario to a meeting room and define our datasets and annotations accordingly. The temporal context of these features is modeled...
[Read more - PDF]
* Robustness of Phase based Features for Speaker Recognition
P Rajan, SHK Parthasarathi, and H A Murthy
Proceedings of Interspeech 2009. Brighton, UK. September 2009.
Abstract
This work demonstrates the robustness of group-delay based features for speech processing. An analysis of group delay functions is presented which show that these features retain formant structure even in noise. Furthermore, a speaker verification task performed on the NIST 2003 database show lesser error rates..
[Read more - PDF]
* Exploiting contextual information for speech/non-speech
detection
SHK Parthasarathi, P Motlicek, and H Hermansky
Proceedings of TSD 2008, LNCS/LNAI series, Springer-Verlag. Brno, Czech Republic. September 2008.
Abstract
In this paper, we investigate the effect of temporal context for speech/non-speech detection (SND). It is shown that even a simple feature such as full-band energy, when employed with a large-enough context, shows promise for further investigation...
[Read more - PDF]
* A Pattern Recognition Approach to VAD using modified group delay
P Rajan, SHK Parthasarathi, and H A Murthy
Proceedings of NCC 2008. IIT Bombay, India. January 2008.
Abstract
This paper explores the use of phase-based features (in particular, group delay) for voice activity detection (VAD). We establish via theoretical analysis the robustness of the group delay function in noise. Based on this, we extract group delay based features and pose the VAD problem as a two-class classification task...
[Read more - PDF]
* Design and Development of a Text-To-Speech Synthesizer for Indian
Languages
Y R Venugopalakrishna, SHK Parthasarathi, S Thomas, K Bommepally, K Jayanthi, H Raghavan, S Murarka, H A Murthy
Proceedings of NCC 2008 IIT Bombay, India. January 2008.
Abstract
This paper describes the design and implementation of a unit selection based text-to-speech synthesizer with syllables and polysyllables as units of concatenation. The choice of syllable as a unit for Indian languages is appropriate as Indian languages are syllable-centered. Although, syllable based synthesis does not require significant prosodic modification...
[Read more - PDF]
* Voice Activity Detection using Group Delay Processing on
Buffered Short-term Energy
SHK Parthasarathi, P Rajan, and H A Murthy
Proceedings of NCC 2007 IIT Kanpur, India. January 2007.
Abstract
In this paper, we present an algorithm for Voice Activity Detection (VAD) in speech signals using the minimum phase group delay function. The proposed method considers a buffer consisting of contiguous frames of the given signal and computes the short-term energy (STE) for that buffer. By appending a surrogate signal to STE and viewing the resultant signal as a positive part of the magnitude spectrum of an arbitrary signal...
[Read more - PDF]
* Robust Voice Activity Detection using Group Delay Functions
SHK Parthasarathi, P Rajan, and H A Murthy
Proceedings of IEEE ICIT 2006 IIT Bombay, India. December 2006.
Abstract
In this paper, we present an algorithm for Voice Activity Detection (VAD) in speech signals with very low SNR. In the proposed algorithm, the short-term energy of the speech signal is viewed as the positive frequency part of the magnitude spectrum of a minimum phase signal. The group delay of this signal is then computed. The speech regions of the signal are characterized by well-defined peaks in the group delay spectrum...
[Read more - PDF]