← All reviews

Predicting the Understandability of Imperfect English Captions for People Who Are Deaf or Hard of Hearing

Sushant Kafle, Matt Huenerfauth · 2019 · ACM Transactions on Accessible Computing (TACCESS) · doi:10.1145/3325862

Summary

This paper tackles a fundamental measurement problem in ASR-based captioning for Deaf and Hard-of-Hearing (DHH) users: the standard Word Error Rate (WER) metric has little correlation with how DHH users actually perceive caption quality. WER treats all word errors as equally impactful, but in reality the importance of an error depends on which word was misrecognized and how semantically different the error is from the intended word. The authors develop and refine the Automatic Caption Evaluation (ACE) metric across four research phases. The metric combines two core components: a word-importance score (how critical the misrecognized word is to sentence meaning, computed using neural language models) and a semantic distance score (how different the recognized word is from the intended word, computed using word2vec embeddings). Phase 1 established the original ACE metric and demonstrated its superiority over WER in a user study with DHH participants comparing caption preferences. Phase 2 systematically evaluated improvements to both components — testing n-gram, neural network, and TF-IDF models for word importance, and different aggregation strategies for combining per-error scores into sentence-level quality scores. This produced ACE2, which uses a neural language model for word importance and a novel error-spread aggregation method that accounts for how the impact of one error radiates to affect comprehension of surrounding words.

Key findings

ACE2 achieved a Spearman correlation of 0.866 with DHH users' subjective quality judgements in the PREFERENCE-2017 dataset, dramatically outperforming WER's correlation of just 0.108 with the same dataset. In Phase 3, ACE2 also outperformed six other published ASR evaluation metrics, including Human Perceived Accuracy (rho = 0.730), Word Information Lost (rho = 0.789), and Weighted WER (rho = 0.742). Phase 4 validated with a new summative study (PREFERENCE-2018) using 12 DHH participants evaluating 60 caption texts generated by three commercial ASR systems (Google Cloud Speech, IBM Watson, CMUSphinx). ACE2 again showed significantly higher correlation with DHH judgements than the original ACE metric (r = 0.5519 vs. 0.3927, p < 0.05). The error-spread aggregation method was the key innovation: it models how a single misrecognized word degrades comprehension of its neighbours — for instance, misrecognizing "kitchen" as "kitten" makes the subsequent word "area" even harder to interpret because a contextual cue has been lost. Two factors proved most predictive of error impact: the importance of the misrecognized word within the sentence, and the semantic distance between the error word and the intended word.

Relevance

This research has direct implications for anyone deploying ASR-based captioning for DHH users — in classrooms, meetings, conferences, or media. The finding that WER correlates at only 0.108 with DHH users' actual comprehension is a stark warning: optimizing ASR systems for WER may produce systems that perform well on paper but poorly serve DHH users. ACE2 provides a validated alternative metric that developers and researchers can use to evaluate and compare captioning systems from the DHH user perspective. For organizations choosing between ASR providers for captioning services, ACE2 could inform procurement decisions based on real comprehensibility rather than raw word accuracy. The research also reinforces a broader principle from the Kafle et al. AI fairness paper: standard evaluation metrics designed for general populations may fundamentally misalign with disabled users' needs, and developing disability-specific metrics is essential for building truly accessible AI systems.

Tags: automatic speech recognition · captioning · deaf and hard of hearing · evaluation metrics · word error rate · caption quality · word importance · semantic distance · NLP