Online Quality Control for Real-Time Crowd Captioning

Walter S. Lasecki, Jeffrey P. Bigham · 2012 · Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2012) · doi:10.1145/2384916.2384942

Summary

This paper addresses quality control in Legion:Scribe, a system that provides real-time captioning by having multiple non-expert crowd workers simultaneously type what they hear, then automatically merging their partial transcriptions into a single caption stream. Real-time captioning is essential for deaf and hard of hearing (DHH) people to participate in classrooms, meetings, and live events, but current options are either prohibitively expensive (professional CART stenographers at $60+/hour, requiring 2-3 years of training) or error-prone (automatic speech recognition). Legion:Scribe offers a middle ground by leveraging crowds of untrained typists, but when recruiting on-demand workers from platforms like Amazon Mechanical Turk, input quality varies dramatically — approximately a quarter of workers in the study produced clearly bad input, either from misunderstanding the task, not hearing the audio, or simply being inattentive. The paper introduces two methods for estimating quality in real time: per-worker quality scoring (rating workers based on how much their overall input overlaps with other workers) and word-by-word quality filtering (accepting individual words only if multiple workers agree on them within a 10-second window). Both methods use inter-worker agreement as a proxy for correctness.

Key findings

In experiments with 42 Mechanical Turk workers captioning 20 minutes of MIT OpenCourseWare lecture audio, the word-by-word quality method raised precision from 57.8% to 81.2% by requiring just 2 workers to agree on a word, though at the cost of reduced coverage. The per-worker quality method showed that requiring even modest agreement (10%) substantially increased precision from 82.9% to 93.2% with no change in coverage, because it primarily filtered out workers who completely misunderstood the task. At higher agreement thresholds, per-worker filtering became unstable as too few workers were selected. Average latency decreased from 4.4 seconds with a single worker to 2.6 seconds with the full group, as each additional worker increased the chance of fast coverage. Turkers had significantly higher initial delay than student volunteers (5091ms vs 2477ms) but similar per-word progressive delay (~325ms vs ~268ms per consecutive word). The total cost for 20 minutes of captioning was $9.55, dramatically cheaper than professional services. Workers were consistent in quality throughout the session — those who started well stayed good, and vice versa.

Relevance

This research demonstrates a scalable, affordable alternative to professional real-time captioning that could dramatically expand access for DHH people in everyday situations where CART services are unavailable or unaffordable. The quality control methods are practical and lightweight, requiring no pre-screening of workers. For accessibility practitioners, the key insight is that crowd agreement is a powerful quality signal even in real-time contexts — bad workers tend to produce unique errors while good workers converge on the correct transcription. The tradeoff between precision and coverage (accuracy vs completeness) is directly relevant to captioning design decisions: users may prefer accurate but incomplete captions over complete but error-filled ones. The work also highlights that combining crowd captioning with post-hoc editing interfaces could achieve professional-grade results at a fraction of the cost, making real-time captioning accessible for informal settings like study groups, casual conversations, or small meetings where professional services are impractical.

Tags: real-time captioning · crowdsourcing · deaf and hard of hearing · automatic speech recognition · human computation · quality control · CART