Deaf and Hard-of-Hearing Perspectives on Imperfect Automatic Speech Recognition for Captioning One-on-One Meetings

Larwan Berke, Christopher Caulfield, Matt Huenerfauth · 2017 · Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '17) · doi:10.1145/3132525.3132541

Summary

This paper investigates whether and how to display word-level confidence information from Automatic Speech Recognition (ASR) systems in real-time captions for Deaf and Hard-of-Hearing (DHH) users during one-on-one meetings with hearing people. ASR engines assign confidence scores to each word they transcribe, but this information is typically hidden from users. The researchers hypothesised that showing DHH users which words the ASR system was uncertain about could help them better interpret imperfect captions. The research consisted of two studies using simulated one-on-one business meeting videos with ASR-generated captions (average Word Error Rate of 23.2%). A pilot study with 21 DHH participants compared 12 different visual markup styles for displaying confidence — including bold, colour changes, font size variations, underlining, italics, deletion of uncertain words, and grayscale gradients — applied to either confident or uncertain words. Based on pilot results, four conditions were selected for a larger study with 107 DHH participants: No Change (baseline), Italics on Uncertain, Underline on Uncertain, and Yellow+Bold on Uncertain. The larger study used both quantitative measures (binary preference questions, Likert-scale helpfulness ratings, forced rankings) and qualitative analysis (open coding of 364 comments totalling 6,112 words by a Deaf native ASL signer and a hearing researcher).

Key findings

The most striking finding was a paradox: while participants expressed initial interest in confidence markup conceptually, after actually experiencing it they significantly preferred captions with no confidence markup at all. In the larger study, the No Change baseline received significantly higher preference scores than Underline (p=0.00085), Yellow (p=0.00025), and Italics (p=0.02307). When asked to rank all four styles, No Change was ranked first by 86.4% of participants. Qualitative analysis revealed several reasons: distraction was the most common concern (N=32), with participants reporting that markup made it harder to focus on content while simultaneously reading captions and watching their conversational partner. Crucially, markup increased the perceived inaccuracy of captions — participants believed marked-up captions were less accurate than unmarked ones, even though error rates were identical across conditions. Some participants (N=13) also expressed resistance to unfamiliar captioning styles, preferring what they knew from TV closed captions. However, participants did identify benefits: markup increased awareness of errors, helped them understand how ASR works, and increased their confidence in the technology as a tool. Participants were also concerned about ASR accuracy overall, the potential for ASR to replace human ASL interpreters, the lack of bidirectional communication support, and reliability for impromptu use. Suggested applications included public places, airports, appointments, and cultural events.

Relevance

This research provides important evidence for anyone developing automated captioning systems for DHH users. The central tension — users want to know about ASR uncertainty in theory, but find confidence markup distracting in practice — has significant design implications. It suggests that simply exposing system confidence visually is not the right approach; alternative methods such as on-demand confidence checking, post-meeting review with markup, or confidence-based audio/haptic alerts may be more effective. The finding that confidence markup increases perceived inaccuracy is a cautionary tale about transparency features that can inadvertently reduce trust in assistive technology. For accessibility practitioners, the study also underscores that DHH users have diverse needs and experiences with captioning — approximately 90% of deaf children are born to hearing parents and may have varying English literacy levels. The paper's analysis of text complexity (Flesch-Kincaid Grade Level of 8.6 for the stimuli) points to the importance of considering literacy when designing caption systems. Limitations include the simulated rather than live meeting context and the single WER level tested.

Tags: deaf accessibility · automatic speech recognition · captioning · communication · user research · deaf and hard of hearing · confidence display