Evaluation of Real-time Captioning by Machine Recognition with Human Support

Hironobu Takagi, Takashi Itoh, Kaoru Shinkawa · 2015 · Proceedings of the 12th International Web for All Conference (W4A) · doi:10.1145/2745555.2746648

Summary

This paper from IBM Research Tokyo investigates a hybrid approach to real-time captioning that combines Automated Speech Recognition (ASR) with human correction to make workplace meetings accessible for deaf and hard of hearing (DHH) employees. Professional stenography services (CART) cost $120-200 per hour, making them impractical for routine workplace meetings. The challenge is particularly acute for Japanese, where approximately 4,000 characters are in everyday use and standard speaking speeds of 400-600 characters per minute far exceed typical typing speeds of 100-200 cpm. The researchers built a web-based captioning system using Etherpad as a collaborative text editor, where ASR automatically segments and transcribes speech, then distributes segments to multiple non-expert human captioners who listen to the audio and correct recognition errors. The system was tested across four real-world Japanese meeting sessions ranging from 3.7 to 109.9 minutes, with 3-8 captioners per session. The study addresses a practical gap between fully automated ASR (too inaccurate) and fully manual stenography (too expensive) by leveraging non-expert workers who only need to correct errors rather than transcribe from scratch.

Key findings

ASR error ratios ranged from 29.3% to 42.8% across sessions, heavily influenced by microphone quality and setup — handheld microphones held close to speakers produced significantly better results than standing microphones affected by ambient noise. Average caption latency ranged from 22.8 to 58.8 seconds, with human correction accounting for 84.8-93.1% of total latency — substantially higher than the 2.89-second latency of crowd-based systems like Legion:Scribe. Captioner skill levels varied considerably, with the fastest captioner consistently outperforming others across sessions. The researchers estimated that as few as two skilled captioners could provide real-time captions, while four less-skilled captioners would be needed. DHH participants valued the verbatim nature of the output compared to precis writing (which captures only about 20% of conversation content), but found the continuously updating caption display difficult to follow as corrections appeared in real-time across multiple lines.

Relevance

This research demonstrates an important middle ground between expensive professional captioning and unreliable fully automated solutions for workplace accessibility. The hybrid ASR-plus-human model has particular relevance for organizations seeking to make routine meetings accessible without the cost of CART services. While the 22-59 second latency is too high for true real-time conversation participation, the approach shows promise for providing verbatim records that DHH employees value over summarized alternatives. The findings about microphone setup significantly affecting accuracy are immediately actionable for any organization implementing speech recognition-based captioning. Since 2015, ASR accuracy has improved dramatically, which would reduce the human correction burden and latency, making this hybrid approach increasingly viable for daily workplace use.

Tags: real-time captioning · deaf and hard of hearing · automated speech recognition · workplace accessibility · Japanese · speech recognition · captioning