Exploration of Automatic Speech Recognition for Deaf and Hard of Hearing Students in Higher Education Classes

Janine Butler, Brian Trager, Byron Behm · 2019 · Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2019) · doi:10.1145/3308561.3353772

Summary

This paper presents a qualitative study of how deaf and hard of hearing (DHH) students at the National Technical Institute for the Deaf (Rochester Institute of Technology) experienced automatic speech recognition (ASR) as a supplemental access service in mainstream higher education classes. ASR-generated real-time captions were provided alongside existing access services (sign language interpreters and TypeWell/C-Print human captioning) in biology and statistics courses. The system used Microsoft Custom Speech Service with custom language models built from course-specific vocabulary, glossaries, and prior transcripts from human captionists — improving keyword accuracy for domain terms that generic ASR would otherwise mistranscribe (e.g., "atom" transcribed as "Adam," "SQL" as "sequel"). Captions appeared on-screen next to PowerPoint slides and on students’ personal devices. Twenty-six DHH students completed a questionnaire and eight were interviewed in depth. Participants had diverse communication preferences: 9 preferred ASL or Signed English, 10 preferred spoken English, and 7 preferred simultaneous communication. A nationally certified sign language interpreter conducted all interviews, with four participants signing and four speaking their responses. The study uniquely examines ASR as a supplemental rather than primary access service, recognising that ASR accuracy (85-90% in these courses) was not yet sufficient to serve as the sole means of access.

Key findings

Participants found ASR beneficial despite persistent errors, primarily as a supplement to their existing access services. Key benefits: ASR provided access to the instructor’s exact words (unlike sign language interpretation, which conveys meaning rather than verbatim content); students used ASR to catch what they missed from the interpreter or to reference after class; ASR appeared alongside course slides, reducing the need to shift attention between multiple locations; and three signing participants described ASR as less straining than watching an interpreter continuously. However, seven of eight interviewees identified inaccuracy as ASR’s primary limitation, with errors affecting readability, comprehension, and confidence. When shown passages with varying Word Error Rates (WERs), acceptance dropped sharply with accuracy: 96% of participants accepted passages above 90% accuracy, 69% accepted 85-89% accuracy, 33% accepted 80-84%, and only 28% accepted below 80%. Critically, when domain-specific keywords were mistranscribed ("mitosis" as "my toast," "meiosis" as "my hostess," "cells" as "tells"), even a single error could render the passage incomprehensible for learning purposes. When forced to choose between accuracy and speed, participants were split: three preferred speed (to participate in real-time discussion) while five preferred accuracy. Students were more likely to accept current accuracy rates as a supplementary service (21 of 26) than as a sole service (16 of 25). Confidence in ASR was moderate: only one student was "very confident" while nine were "somewhat confident" and others were neutral or unconfident.

Relevance

This paper provides essential evidence for institutions considering deploying ASR captioning for DHH students. The finding that domain-specific vocabulary errors are far more damaging to comprehension than general transcription errors has direct implications for ASR implementation: custom language models with course-specific terminology are not optional enhancements but ethical necessities. The acceptance threshold data — showing that 90%+ accuracy is needed for broad acceptance — gives ASR developers a concrete target for educational deployment. For accessibility practitioners, the study highlights four actionable recommendations: (1) ASR currently works best as a supplement to, not replacement for, human access services; (2) custom language models must be developed per course; (3) ASR captions should be placed near other information sources (slides, interpreter) to minimize attention-switching fatigue; and (4) instructors should be coached to speak at a pace that ASR can follow. The concern that ASR could eventually replace human captionists and interpreters — a fear expressed in prior research — is implicitly addressed: at current accuracy levels, ASR serves a valuable complementary role but cannot substitute for human services.

Tags: speech recognition · Deaf and hard of hearing · captioning · higher education · real-time captions · CART · sign language interpreter · word error rate · STEM accessibility