A Readability Evaluation of Real-Time Crowd Captions in the Classroom

Raja S. Kushalnagar, Walter S. Lasecki, Jeffrey P. Bigham · 2012 · Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2012) · doi:10.1145/2384916.2384930

Summary

This paper evaluates the readability of real-time captions produced by three different approaches in a higher education classroom setting: professional CART (Communication Access Realtime Translation) captionists, automatic speech recognition (ASR), and a novel crowd captioning system where multiple non-expert classmates simultaneously type partial captions that are automatically aligned and merged into a single transcript. The study addresses a critical accessibility gap: deaf and hard of hearing (DHH) students need real-time captions to access lecture content, but professional captionists cost over $100/hour, are scarce (especially in technical fields), require advance scheduling, and often lack domain-specific vocabulary. ASR is cheaper but produces unreadable output in realistic classroom conditions — accuracy drops below 50% with untrained speakers and standard microphones, and errors change word meanings rather than simply omitting words. The crowd captioning approach uses the Legion:Scribe system, which accepts partial caption streams from multiple non-expert typists and merges them using an online multiple sequence alignment algorithm. For the evaluation, transcripts were generated from a 50-minute MIT OpenCourseWare lecture containing both technical and non-technical vocabulary (9,137 words, 182.7 WPM). The professional captionist typed at 180 WPM with 4.2-second latency, the crowd of 20 students achieved 130 WPM with 3.87-second latency, and ASR produced 71 WPM with 7.9-second latency.

Key findings

In a study with 48 participants (21 deaf, 4 hard of hearing, 24 hearing), crowd captions received slightly higher mean readability ratings (3.15 on a 5-point Likert scale) than professional CART captions (3.08), though the difference was not statistically significant. Both were rated significantly higher than ASR (median rating of 1, "very hard"). Hearing students showed a significant preference for crowd captions over professional captions, while deaf students showed no significant preference between the two. Qualitative feedback revealed that transcript flow — the smoothness and pace at which text appeared — was as important as accuracy metrics like coverage and precision. Students found crowd captions easier to read because the vocabulary and phrasing made more sense, the word order was more logical, and errors (typically omitted words or spelling mistakes) were easier to recover from than ASR errors (which substituted wrong words that changed meaning). The crowd captioning approach also produced more accurate technical vocabulary because classmates were familiar with the subject matter. Professional captions sometimes confused students because the stenographer's summarization reflected their own understanding of the material rather than the speaker's exact words.

Relevance

This study provides compelling evidence that crowd captioning by non-expert classmates can match or exceed the readability of professional CART captioning in educational settings — a finding with significant implications for classroom accessibility. The practical advantages are substantial: crowd captioning is dramatically cheaper, immediately available without advance scheduling, scalable across institutions, and better adapted to specialized technical vocabulary. For accessibility practitioners and disability services offices, this research suggests that peer-based captioning systems could supplement or replace professional CART in situations where captionists are unavailable, too expensive, or lack domain expertise. The finding that caption flow matters as much as word-for-word accuracy challenges the assumption that verbatim transcription is always the gold standard. The error analysis is particularly valuable: human typists make "graceful" errors (missing words) that readers can work around, while ASR makes "catastrophic" errors (wrong words) that change meaning and require backtracking. This distinction remains relevant when evaluating modern ASR systems for accessibility accommodations.

Tags: real-time captioning · deaf and hard of hearing · crowdsourcing · classroom accessibility · higher education · speech recognition