Tracked Speech-To-Text Display: Enhancing Accessibility and Readability of Real-Time Speech-To-Text
Raja S. Kushalnagar, Gary W. Behm, Aaron W. Kelstone, Shareef Ali · 2015 · ASSETS '15: Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility · doi:10.1145/2700648.2809843
Summary
This research addresses a subtle but significant barrier facing deaf and hard of hearing (DHH) students in educational settings: visual dispersion. While hearing students can simultaneously watch lecture visuals (slides, demonstrations, whiteboard) and listen to the speaker's explanation, DHH students must constantly shift their gaze between the speech-to-text display and other visual information sources. This sequential rather than simultaneous access means DHH students spend less time on lecture visuals and frequently lose their place in the rolling caption text when returning from viewing other content. The researchers developed a Tracked Speech-to-Text Display (TSD) that uses a Microsoft Kinect 2 to track the presenter's location and project captions at a fixed distance above the presenter's head. This minimizes visual dispersion by keeping both the caption display and the presenter within the student's peripheral vision. The system integrates with C-Print, a widely-used captioning system developed at NTID/RIT where trained typists use abbreviation expansion to transcribe speech in near-real-time. The TSD system is designed to be portable (laptop, Kinect 2, optional projector), affordable (-1000), and quick to set up (under one minute). A key design feature is the ability to configure the number of displayed caption lines (1-6), recognizing that readers have different needs based on reading speed and English fluency. The system also includes smoothing algorithms to reduce jarring movement as it tracks the presenter.
Key findings
Two studies evaluated the system. In the first study (44 hearing, 18 deaf students), comparing traditional fixed-location speech-to-text display (SD) versus the tracked display (TSD), both groups found TSD significantly easier for seeing the teacher, whiteboard, and captions simultaneously, and significantly more helpful overall. Students reported feeling more "connected" with the teacher when using TSD. There was no significant difference in ease of reading between SD and TSD, despite initial concerns about distracting caption movement. The second study (21 hearing, 11 deaf students) compared TSD with 3 lines versus 6 lines of text. A striking finding emerged: DHH students significantly preferred 3 lines, while hearing students significantly preferred 6 lines. DHH students, for whom speech-text is their only access to spoken content, wanted fewer lines to minimize search time when returning their gaze to the display. Hearing students, who use speech-text as backup to review missed content, preferred 6 lines because the 4-5 second transcription delay means older text at the top of a 6-line display better matches what they just heard. Qualitative feedback revealed students appreciated the smoothed movement tracking (after iteration from initial jerky movement), the ability to see teacher expressions and body language, and the customizable line count. Both deaf and hearing students would recommend TSD to others.
Relevance
This research highlights an often-overlooked accessibility challenge: even when accommodations like speech-to-text are provided, DHH students may still face subtle barriers that reduce their access compared to hearing peers. The visual dispersion problem transforms the simultaneous audio-visual experience of lectures into a sequential reading-viewing experience, with measurable impacts on comprehension and time spent on instructional visuals. For practitioners, the key insight is that caption placement matters as much as caption quality. Positioning captions near the primary visual focus area (the speaker) reduces cognitive load and gaze-switching overhead. The finding that DHH and hearing users have opposite preferences for caption line count demonstrates the importance of user customization rather than one-size-fits-all solutions. The research also challenges assumptions that current accommodations provide "full access." DHH graduation rates remain significantly lower than hearing peers (16% vs 30%), and the authors argue that improving the accessibility of existing accommodations—not just providing them—is essential for educational equity. Future work includes extending TSD to conference and theater settings and exploring personal viewing devices for individualized preferences.
Tags: deaf and hard of hearing · speech-to-text · CART · captioning · education · visual dispersion · classroom accessibility · real-time captioning · C-Print