The Effects of Automatic Speech Recognition Quality on Human Transcription Latency

Yashesh Gaur, Walter S. Lasecki, Florian Metze, Jeffrey P. Bigham · 2016 · Proceedings of the 13th International Web for All Conference (W4A) · doi:10.1145/2899475.2899478

Summary

This paper from Carnegie Mellon University and the University of Michigan empirically investigates when automatic speech recognition (ASR) output helps or hinders human transcriptionists producing captions for deaf and hard of hearing people. Manual transcription remains necessary because ASR quality is insufficient in many real-world settings, but manual conversion can take over 5 times the original audio length, introducing significant latency. The intuitive approach of giving human captionists ASR output as a starting point to edit rather than typing from scratch seems promising, but the authors hypothesize a crossover point where editing low-quality ASR becomes harder than starting fresh. This is because editing requires more cognitive effort than typing — captionists must simultaneously listen to audio, read the ASR text, identify discrepancies, and make corrections, whereas typing from scratch primarily involves motor transcription with minimal cognitive mapping. Additionally, phoneme-based ASR systems produce homophone errors that can trick editors into accepting incorrect words. The researchers used the TEDLIUM dataset of TED talks with the Kaldi speech recognition toolkit, generating ecologically valid ASR transcripts at different error rates by modifying the decoder beam width parameter rather than artificially introducing errors.

Key findings

Two studies on Amazon Mechanical Turk (160 participants between-subjects, 16 participants within-subjects) converged on a critical threshold: ASR output is only beneficial as a starting point when the Word Error Rate (WER) is below approximately 30%. Above 30% WER, editing ASR output takes longer than typing from scratch. An unexpected finding emerged at very high error rates (above 45%): latency actually decreased because workers recognized the ASR was too poor to salvage and deleted large chunks to retype from scratch — 42.52% of workers cleared the text at WER above 50%, compared to just 7.12% at WER below 30%. Interaction logs revealed a three-phase editing pattern: an initial edit phase (in-place corrections), a rewrite phase (adding new words without removing old ones), and a delete phase (removing remaining incorrect content). The within-subjects study showed that exposure to good ASR transcripts made participants less willing to abandon poor ones, suggesting that mixed-quality ASR may create a persistence bias. A one-way ANOVA on normalized within-subjects data confirmed the increasing latency trend was significant (p=0.01).

Relevance

This research has direct implications for real-time captioning accessibility — the primary means by which deaf and hard of hearing people access live spoken content in lectures, meetings, and events. The 30% WER threshold provides a practical benchmark: captioning systems should only present ASR output for human editing when they can achieve this level of accuracy, otherwise they should simply provide a blank slate. This finding is particularly relevant as hybrid human-AI captioning systems become more common. The work also speaks to the broader design of human-AI collaboration in accessibility: poor AI assistance can actually be worse than no assistance, a counterintuitive insight that applies beyond captioning to any domain where humans are asked to correct automated output. The observation that workers struggle to recognize when ASR quality has crossed the usefulness threshold (not detecting it until 45% WER despite the actual crossover at 30%) suggests captioning interfaces should incorporate automatic quality detection to decide when to show ASR output. This research predates the dramatic improvements in ASR accuracy since 2016, but the underlying principle — that there exists a quality threshold below which human-AI collaboration becomes counterproductive — remains broadly applicable.

Tags: speech recognition · captioning · deaf and hard of hearing · crowdsourcing · human computation · automatic speech recognition