Comparing speaker-dependent and speaker-adaptive acoustic models for recognizing dysarthric speech

Frank Rudzicz · 2007 · Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '07) · doi:10.1145/1296843.1296899

Summary

This short ASSETS 2007 poster from Frank Rudzicz at the University of Toronto compares two strategies for building automatic speech recognition (ASR) acoustic models that work for people with dysarthria — a set of motor speech disorders that produces speech with high intra- and inter-speaker variability and that mainstream ASR systems trained on able-bodied speakers handle poorly. Speaker-dependent (SD) models are trained from scratch on a single dysarthric speaker; speaker-adaptive (SA) models start from a baseline trained on a large able-bodied corpus (here, the Wall Street Journal) and are then adjusted with that speaker's data. Prior work by Raghavendra, Rosengren and Hunnicutt (2001), Noyes and Frankish (1992), and Sawhney and Wheeler (1999) had reported mixed conclusions, often based on only a handful of test subjects, with the prevailing view that SA suits mild and moderate dysarthria while SD suits severe dysarthria. Rudzicz uses the Nemours database of 11 male dysarthric speakers (each producing 74 syntactically constrained nonsense sentences of the form "The N0 is V-ing the N1") plus one non-dysarthric control. Speakers are stratified as mild, moderate, or severe based on a baseline WSJ-model recognition rate. Both SD and SA models use triphone left-right Hidden Markov Models with Gaussian-mixture output densities, Viterbi decoding over a lexical-tree CFG-augmented structure, and iterative Baum-Welch training; the number of Gaussians and the amount of training data are varied independently.

Key findings

Increasing the amount of training data from 20 to 132 sentences per speaker did not produce reliable gains — accuracy fluctuated by about 3% — suggesting Nemours does not contain enough material to capture each speaker's intra-speaker variability and that prior small-N studies probably needed more data too. Accuracy did increase monotonically with the number of Gaussian mixture components for both mild and severe groups. The headline result is that, contrary to earlier reports, the speaker-adaptive model beat the speaker-dependent model in every group except the most severe: relative error reductions of 23.1% in the mild group, 4.9% in the moderate group, and 30.7% for the non-dysarthric control. SD only narrowly beat the WSJ baseline for severe speakers. The most common phonemic errors in the baseline were substitutions /ng/→/n/ (125 occurrences), /t/→/uw/ (87), /ey/→/ih/ (84), and consonant deletions of /b/, /s/, /w/, /f/, /l/, suggesting that better robustness to consonant variation would generalise across dysarthric speakers. Ongoing work was to add electromagnetic articulographic data, n-gram language modelling, and discriminative classifiers including recurrent neural networks.

Relevance

For accessibility practitioners and AT researchers, this paper is a useful early data point in a research programme — Rudzicz went on to build the much larger TORGO database and a substantial body of work on dysarthric ASR — that ultimately reshaped how the field thinks about adapting mainstream speech models for atypical speech. The practical takeaway is counter-intuitive: starting from a large able-bodied baseline and adapting it usually beats training from scratch on the disabled speaker, even when the baseline is a poor initial fit, because the baseline carries phonetic structure that small dysarthric corpora cannot. This argues against the once-common assumption that disabled users need bespoke models and supports today's practice of fine-tuning large general-purpose speech models on small disability-specific datasets (cf. Google's Project Euphonia and Apple's Personal Voice work). Limitations are substantial: only 11 male speakers, a single small corpus, syntactically constrained sentences, no live user evaluation, and no comparison against discriminative or neural models — all addressed in subsequent work by the author and others.

Tags: dysarthria · automatic speech recognition · acoustic model · speaker adaptation · hidden Markov model · phoneme · speech intelligibility · motor speech disorders · machine learning · augmentative and alternative communication · speech accessibility · voice interface