Leveraging Complementary Contributions of Different Workers for Efficient Crowdsourcing of Video Captions

Yun Huang, Yifeng Huang, Na Xue, Jeffrey P. Bigham · 2017 · CHI Conference on Human Factors in Computing Systems · doi:10.1145/3025453.3026032

Summary

This paper presents BandCaption, a crowdsourcing system that combines automatic speech recognition (ASR) with input from diverse crowd workers to efficiently correct video captions. The key insight is that different groups of people — hearing-impaired users, second-language speakers with low proficiency, second-language speakers with high proficiency, and native speakers — have different abilities and motivations that can be leveraged complementarily for caption correction. Rather than requiring a single skilled captionist or relying solely on error-prone ASR, BandCaption implements a Mark-Edit-Approve workflow where tasks are decomposed into micro-tasks matched to workers' strengths. Workers who rely on captions (hearing-impaired users, second-language learners) can mark errors they encounter while watching, even if they cannot correct them, while more proficient speakers can edit the marked captions. The system bootstraps from YouTube's ASR-generated captions and provides an interface where workers can replay specific caption segments, mark errors with minimal effort, directly edit caption text, comment for other workers, and filter contributions. The design was developed through a participatory process over 9 months with iterative testing by 7 participants.

Key findings

A study with 34 participants across the four groups revealed distinct and complementary contribution patterns. Hearing-impaired participants were uniquely sensitive to missing punctuation — all 4 HI participants expressed frustration about it, while native speakers considered punctuation errors negligible (one admitted "I did not mark the punctuations and repeated words"). The test video had 30% of captions with punctuation problems, validating HI participants' concerns. Second-language learners (both SLL and SLH) were particularly good at identifying missing words and irrelevant words, while native speakers marked the most captions overall (M=78.8 vs. 28.5-39.6 for other groups). In the editing phase, SLH participants corrected most errors but each left some captions uncorrected (M=25.6 failures); native speakers then successfully corrected most of those remaining captions (M=7.25 of 10 extracted failures). The ASR baseline saved participants 93.28% of typing effort on average. Crucially, the captions that different groups marked and failed to correct were largely different from those other groups marked, confirming the value of complementary contributions.

Relevance

This paper offers a practical and inclusive model for scaling video captioning that recognizes the people who most need captions — deaf and hard of hearing individuals and second-language learners — as valuable contributors rather than passive consumers. For organizations struggling with the cost and scale of video captioning (especially educational institutions with MOOC content), BandCaption demonstrates that combining ASR with targeted human correction from diverse workers can be both cost-effective and higher quality than either approach alone. The finding that hearing-impaired users prioritize punctuation while hearing users dismiss it is a critical insight for caption quality standards — punctuation significantly affects comprehension for people who cannot hear the speaker's intonation and pauses. For accessibility practitioners, this research challenges the assumption that captioning requires native-speaker proficiency and shows how inclusive crowdsourcing workflows can transform accessibility consumers into accessibility producers.

Tags: captioning · crowdsourcing · video accessibility · speech recognition · deaf and hard of hearing · second language learners · automatic speech recognition