Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing
Sushant Kafle, Matt Huenerfauth · 2017 · Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '17) · doi:10.1145/3132525.3132542
Summary
This paper addresses a fundamental problem in automatic captioning for Deaf and Hard of Hearing (DHH) users: the standard metric used to evaluate automatic speech recognition (ASR) systems — Word Error Rate (WER) — poorly predicts how usable the resulting captions actually are for DHH readers. WER simply counts the number of word-level errors (substitutions, deletions, and insertions) normalized by text length, treating all errors equally. However, research has shown that different errors have vastly different impacts on comprehension. Some errors are easily recoverable from context while others completely destroy meaning. The authors propose a new metric called Automated-Caption Evaluation (ACE) that considers two factors WER ignores: word predictability (how easily a reader can infer the correct word from surrounding context, measured using n-gram language models and entropy) and semantic distance (how far the error word's meaning deviates from the intended word, measured using word2vec). The metric weights these factors to produce an impact score for each error, with the maximum error impact determining the overall sentence score. The rationale draws on literacy research showing that many deaf adults use a "keyword" reading strategy, focusing on high-content words to derive sentence meaning — making errors on unpredictable, semantically important words far more damaging than errors on easily guessed function words.
Key findings
In a user study with 30 DHH participants (14 deaf, 8 Deaf, 8 hard-of-hearing, mean age 23.5) from Rochester Institute of Technology, participants evaluated 45 pairs of ASR-generated caption texts. Each pair had identical WER scores but different ACE scores. Results strongly supported all hypotheses. For H1, DHH users significantly preferred the caption texts favored by ACE over those not preferred by ACE (Wilcoxon signed-rank W=643394, p<0.0001), with a median difference of 2.5 points on a 10-point usability scale. For H2a, ACE scores correlated significantly with participants' subjective usability ratings (Spearman rho=0.74, p<0.0001). For H2b, the correlation between ACE and human judgments was significantly higher than the correlation between WER and human judgments (rho=0.74 vs. rho=0.11, z=5.771, p<0.0001). The alpha parameter weighting predictability versus semantic distance was tuned to 0.65, indicating that word predictability contributed more to the overall impact score than semantic distance. The ACE metric was designed for real-time captioning of one-on-one meetings, a context where professional transcriptionists are typically unavailable.
Relevance
This research has significant practical implications for anyone involved in selecting or developing ASR-based captioning systems for DHH users. The finding that WER — the metric ASR researchers overwhelmingly optimize for — has almost no correlation (rho=0.11) with actual caption usability for DHH readers is striking and suggests that many ASR improvements may not translate to better captioning experiences. The ACE metric offers a more meaningful evaluation tool that could be used to select among competing ASR systems for captioning applications, serve as a loss function during ASR training to produce more caption-friendly output, or flag particularly harmful errors in real-time caption streams. For accessibility practitioners, this paper underscores that technical performance metrics and real-world usability can diverge dramatically, especially when the end users have different reading strategies than the general population. The work also highlights the growing potential of ASR for real-time captioning in informal settings like workplace meetings where professional CART services are impractical.
Tags: captioning · automatic speech recognition · deaf and hard of hearing · evaluation methods · natural language processing · real-time captioning · metrics