Real-Time Captioning by Non-Experts with Legion Scribe

Walter S. Lasecki, Christopher D. Miller, Raja Kushalnagar, Jeffrey P. Bigham · 2013 · Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2013) · doi:10.1145/2513383.2513401

Summary

This short paper introduces Legion Scribe (Scribe), a system that enables 3-5 non-expert typists to collectively caption speech in real time, achieving accuracy approaching that of a professional stenographer at 20-30% of the cost. The system addresses a critical accessibility gap: real-time captioning for deaf and hard of hearing people currently depends on professional stenographers who charge $100-300/hour, require scheduling days in advance, and can only be booked in one-hour blocks. Meanwhile, automatic speech recognition captures only about 40% of speech in real settings and produces confusing errors. Scribe works by having multiple ordinary people who can hear and type each caption part of what they hear simultaneously. The worker interface encourages real-time typing by locking in words shortly after they are typed, using visual and audio cues to direct captionists to specific segments, reducing volume during off-periods while increasing rewards during on-periods. The system then automatically stitches partial captions together using a Multiple Sequence Alignment algorithm to form a complete caption stream. Workers can be recruited on-demand from crowdsourcing marketplaces like Amazon Mechanical Turk, or from local pools such as work-study students paid around $10/hour.

Key findings

As few as 3 average typists can match the performance of an expert stenographer when their partial captions are merged by Scribe's alignment algorithm. The system achieves a target latency of less than 5 seconds, which is comparable to or better than professional captioning. The cost is dramatically lower — 20-30% of hiring an expert stenographer. Non-expert captionists can be recruited on-demand without advance scheduling, enabling access to captioning for previously uncovered scenarios such as impromptu meetings, casual conversations, and last-minute events. The interface design choices — word locking, saliency-based audio cues, and point-based rewards — effectively encourage workers to type quickly and cover different portions of the audio stream.

Relevance

This ASSETS 2013 paper is the initial conference presentation of the Legion Scribe system, which was later expanded into a full journal article in Communications of the ACM (2017). It represents a foundational contribution to accessible real-time communication, demonstrating that the combination of multiple non-expert humans can substitute for a scarce, expensive expert resource. For accessibility practitioners, the key insight is that on-demand captioning does not require waiting for perfect ASR — human computation can bridge the gap today. The inclusion of Raja Kushalnagar from Rochester Institute of Technology (which houses the National Technical Institute for the Deaf) ensures the work is grounded in real DHH community needs. The system model — decomposing a task too difficult for one non-expert into overlapping subtasks, then algorithmically recombining results — has broad applicability to other accessibility challenges requiring real-time human intelligence.

Tags: real-time captioning · deaf and hard of hearing · crowdsourcing · human computation · assistive technology · captioning