Legion Scribe: Real-Time Captioning by Non-Experts

Walter S. Lasecki, Raja Kushalnagar, Jeffrey P. Bigham · 2014 · ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/2661334.2661352

Summary

This demonstration paper presents Legion:Scribe, a crowd-powered captioning system that enables groups of 3-5 non-expert typists to collectively produce real-time captions with less than 5 seconds of latency. The system addresses the prohibitive cost of professional stenographers (over $100/hour) and the unreliability of automatic speech recognition (which at the time captured only about 40% of speech in real settings). Each worker types at a normal rate, capturing 20-30% of spoken words individually. Their partial, overlapping captions are computationally merged into a single output that is more accurate and complete than any individual worker could produce. The paper details the worker interface design: workers see a text field with the instruction to type words as quickly as possible, words lock in after being typed to encourage real-time input rather than buffering, visual and audio cues indicate when workers should be typing specific segments, audio volume is reduced during off-periods and increased during on-periods, and a point-based reward system incentivizes fast and accurate typing. The merging algorithm uses multiple sequence alignment (MSA) to optimally combine the partial captions, comparing different versions of each word across workers to identify and remove errors.

Key findings

The demonstration showed that Legion:Scribe could produce reliable real-time captions with under 5 seconds of latency using untrained workers. The system achieved this by having each worker caption roughly 3 seconds of audio followed by 9-15 seconds for typing, with visual and audio cues coordinating their efforts. The word-locking mechanism in the interface — where typed words are immediately locked and cannot be edited — encouraged workers to type in real-time rather than waiting to hear a full segment before starting to type. The point system rewarded workers with base points per word plus bonuses for fast input, creating incentives aligned with the real-time requirement. The collective output exceeded what any individual non-expert could achieve, demonstrating that computational merging of partial contributions can overcome the fundamental speed limitation of individual typing.

Relevance

This demo paper complements the longer "Real-Time Captioning with the Crowd" article (Interactions 2014) by providing specific implementation details of the worker interface and incentive design. For accessibility practitioners interested in crowd-powered captioning, the interface design choices are instructive: word-locking prevents editing overhead, audio volume modulation naturally segments worker attention, and point-based rewards align worker incentives with system requirements. The work demonstrates that careful interface design can turn ordinary typists into effective real-time captionists, a concept with implications beyond captioning for any accessibility service that requires splitting a difficult real-time task across multiple non-expert contributors.

Tags: real-time captioning · crowdsourcing · deaf and hard of hearing · speech-to-text · human computation · accessibility