Warping Time for More Effective Real-Time Crowdsourcing

Walter S. Lasecki, Christopher D. Miller, Jeffrey P. Bigham · 2013 · Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2013) · doi:10.1145/2470654.2466269

Summary

This paper introduces TimeWarp, a technique that manipulates audio playback speed to improve crowd workers performance on real-time speech captioning. The core problem is that non-expert typists cannot keep up with natural speaking rates of 150-225 words per minute, forcing them to buffer what they hear and type later, which increases errors and cognitive load. TimeWarp addresses this by dividing the audio stream into alternating "in" periods (where playback is slowed to half speed for captioning) and "out" periods (where playback is sped up to 1.5x to compensate). Each worker hears a differently time-shifted version of the audio so that collectively, the crowd covers the entire speech in real time. The system builds on Legion:Scribe, which already demonstrated that multiple non-expert workers could collectively produce real-time captions by merging their partial contributions. TimeWarp modifies the underlying audio using the WSOLA (Waveform Similarity Based Overlap and Add) algorithm to change playback speed without altering pitch. Workers see a captioning interface with a text entry box, visual and audio cues indicating when to type versus listen, and a score tracker showing their earned points.

Key findings

In a study with 139 remote Mechanical Turk workers over 257 trials, TimeWarp improved mean coverage by 11.4% (workers captured more of the spoken content), precision by 12.6% (fewer errors in what was typed), and reduced per-word latency by 16.8% — all statistically significant improvements. The latency reduction is counterintuitive since slowing playback should introduce delay, but it occurred because workers at normal speed must first listen and memorize before typing (introducing 3.25+ seconds of cognitive buffering delay), while slowed playback allowed them to type words as they heard them. A second study with 24 local participants (more skilled typists) showed a significant latency improvement of 22.5% (from 4.34s to 3.36s per word) though coverage and precision gains were smaller and not significant, as these workers already performed well without TimeWarp. Post-trial interviews revealed that less-skilled workers valued time warping most, while more-skilled workers found it unnecessary. The main complaint across both groups was audio quality degradation from the warping process.

Relevance

TimeWarp demonstrates a creative approach to a fundamental accessibility challenge: providing real-time captions for deaf and hard of hearing people when professional captionists (at +/hour) are unavailable and automatic speech recognition is insufficiently accurate. The technique of modifying the task itself rather than expecting workers to perform beyond their abilities is a broadly applicable design principle. For accessibility practitioners, this research is relevant to situations requiring on-demand captioning — meetings, lectures, events — where professional CART services cannot be arranged. While ASR has improved dramatically since 2013, the underlying insight remains valuable: when human assistance is needed for real-time accessibility tasks, intelligent task decomposition and manipulation of the input can dramatically improve non-expert performance. The approach also has implications for any real-time crowdsourced accessibility service where workers face cognitive bottlenecks.

Tags: real-time captioning · crowdsourcing · deaf and hard of hearing · human computation · speech accessibility