Scribe: Deep Integration of Human and Machine Intelligence to Caption Speech in Real Time

Walter S. Lasecki, Christopher D. Miller, Iftekhar Naim, Raja Kushalnagar, Adam Sadilek, Daniel Gildea, Jeffrey P. Bigham · 2017 · Communications of the ACM · doi:10.1145/3068663

Summary

Scribe is a system that provides on-demand, real-time captioning of live speech for deaf and hard of hearing (DHH) people by combining groups of non-expert human captionists with machine intelligence. The system addresses a critical accessibility gap: professional CART (Communication Access Realtime Translation) captionists are expensive ($120–$200/hour), require advance booking, and are scarce, while automatic speech recognition (ASR) produces unacceptable error rates in real-world conditions (dropping below 50% accuracy with untrained speakers or poor audio). Scribe recruits multiple non-expert typists from Mechanical Turk or volunteer pools, each earning $8–$12/hour, and has them collectively caption an audio stream in real time. The system integrates automated assistance in two key ways. First, its worker interface (TimeWarp) directs each captionist to different portions of the audio, slows playback to half speed during their active captioning period, and adaptively adjusts segment length based on typing speed. Second, it uses a custom weighted A* multiple-sequence alignment (MSA) algorithm to merge the partial, overlapping captions from multiple workers into a single coherent transcript. The system also includes automated speaker segmentation for handling dialogues, saliency adjustments that vary audio volume to encourage coverage of specific segments, and a collaborative editing framework that presents merged captions in a natural reading flow.

Key findings

With just three non-expert workers, Scribe achieved 59.7% coverage of spoken content, and with 10 workers reached 74% coverage out of a possible 93.2% — significantly outperforming ASR at 29.0% coverage on the same audio. Scribe achieved an average latency of 2.89 seconds, well under the 5-second target and improving on CART's 4.38-second latency. The weighted A* MSA algorithm achieved 57.4% accuracy (1-WER), a 29.6% improvement over the graph-based method and 35.4% over MUSCLE-based alignment. TimeWarp significantly improved Mechanical Turk workers' performance: coverage increased 11.39%, precision increased 12.61%, and latency was reduced by 16.77% (all statistically significant). Saliency adjustments more than doubled coverage for words in highlighted periods (54.7% vs 23.3%). ASR alone scored only 36.6% accuracy on the same clips, worse than all crowd-powered approaches, demonstrating the value of human intelligence for real-world captioning scenarios.

Relevance

This research directly addresses one of the most significant barriers faced by deaf and hard of hearing people: access to live spoken content in educational, professional, and social settings. The WHO estimates 360 million people have disabling hearing loss, and many cannot access sign language interpretation. Scribe demonstrates that reliable, affordable, on-demand captioning is achievable by combining crowd workers with intelligent algorithms — making it possible to caption previously inaccessible events like impromptu conversations, last-minute lectures, and informal meetings. The inclusion of Raja Kushalnagar from Gallaudet University grounds the work in the DHH community's actual needs. For accessibility practitioners, Scribe offers a model for hybrid human-AI services where neither component is sufficient alone, and shows that thoughtful interface design (TimeWarp, saliency cues) can dramatically improve non-expert performance on accessibility-critical tasks. The work also highlights how advances in ASR could eventually complement human captionists rather than replace them.

Tags: real-time captioning · deaf and hard of hearing · crowdsourcing · human computation · speech recognition · assistive technology · captioning