Real-Time Captioning by Groups of Non-Experts

Walter Lasecki, Christopher Miller, Adam Sadilek, Andrew Abumoussa, Donato Borrello, Raja Kushalnagar, Jeffrey Bigham · 2012 · UIST '12: Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology · doi:10.1145/2380116.2380122

Summary

This paper presents Legion:Scribe, an end-to-end system that enables groups of non-expert typists to collectively produce real-time captions for deaf and hard of hearing (DHH) people, offering a cheaper and more available alternative to professional stenographers (CART, costing $120-200/hour) and more accurate results than automatic speech recognition (ASR). The key insight is that while no single non-expert can type fast enough to keep up with natural speech (averaging 141 words per minute), multiple workers typing simultaneously can collectively cover the entire audio stream when their partial contributions are intelligently merged. The system streams audio to multiple crowd workers via a Flash Media Server, with each worker typing as much as they can hear. To encourage coverage of different portions of the speech, Scribe artificially varies audio saliency — playing audio at higher volume during assigned "on" periods (4 seconds) and lower volume during "off" periods (6 seconds), with different workers offset so the entire stream is covered. Workers are incentivized with points for correct words, with bonuses during on-periods. The partial captions from all workers are then merged using two algorithms: a multiple sequence alignment (MSA) approach adapted from the MUSCLE bioinformatics package (replacing the nucleotide mutation model with a QWERTY keyboard-based spelling error model and augmenting with Wikipedia-derived corrections), and a novel online dynamic sequence alignment using a graphical model with linked lists and greedy graph traversal weighted by n-gram language model probabilities. The system also includes an end-user interface that presents merged captions as flowing text with confidence indicators, supports collaborative editing, and allows users to adjust the coverage-precision tradeoff via a slider.

Key findings

Evaluation with 20 local participants and 18 Mechanical Turk workers captioning MIT OpenCourseWare lectures demonstrated impressive results. With 10 workers, Scribe achieved 93.2% coverage of the audio stream (the fraction of spoken words that appeared in the merged output within a 10-second window), compared to 88.5% for professional CART, 32.3% for ASR (Dragon Naturally Speaking 11.5), and 29.0% for an average individual worker. Scribe achieved an average per-word latency of 2.9 seconds, significantly better than CART's 4.38 seconds — critical for enabling DHH users to participate in conversations where pairing speech with visual cues matters. Word error rate was 45.1% for the combiner versus 10.9% for CART, 63.4% for ASR, and 60.9% for individual workers — the combiner substantially outperformed both ASR and individual workers. Precision was 80.3% for the combiner versus 94.7% for CART, 48% for ASR, and 87.4% for individuals. The saliency adjustment mechanism effectively directed workers to type specific portions: workers typed 50-55% of words during highlighted on-periods versus only 15-23% during off-periods. With saliency adjustments, only 2 workers were needed to achieve 50% coverage, compared to 6 workers without adjustments. The Mechanical Turk deployment (18 workers, $36.10 total for 20 minutes of speech) achieved 78% collective coverage with an average of 59.7% per worker, demonstrating feasibility with remote crowd workers. Importantly, human errors and ASR errors were complementary — humans substituted semantically similar words while ASR substituted phonetically similar ones — suggesting hybrid human-ASR systems could outperform either alone.

Relevance

Legion:Scribe represents a foundational contribution to crowd-powered accessibility, demonstrating that the collective effort of non-experts can match or exceed professional services on a task previously thought to require years of specialized training. The practical implications are significant: professional CART must be booked in advance, costs $120-200 per hour, and is unavailable for spontaneous situations like after-class conversations, impromptu meetings, or social events. Scribe offers on-demand captioning at a fraction of the cost, potentially democratizing real-time caption access for millions of DHH people. For accessibility practitioners, the paper provides several transferable insights. The saliency-based task allocation — giving workers natural cues about when to contribute rather than rigid assignments — is an elegant solution for coordinating parallel human effort. The coverage-precision tradeoff slider demonstrates user-configurable accessibility, recognizing that different situations require different balances (a casual conversation may favor coverage, while a legal proceeding demands precision). The finding that human and ASR errors are complementary foreshadows modern hybrid approaches used by services like Otter.ai and Google Live Transcribe. More broadly, the paper established the paradigm that groups of workers contributing partial, imperfect inputs that are automatically merged can collectively outperform both individuals and automated systems — a model with applications far beyond captioning.

Tags: real-time captioning · crowdsourcing · deaf and hard of hearing · human computation · text alignment · CART alternatives · assistive technology · speech recognition