Towards More Robust Speech Interactions for Deaf and Hard of Hearing Users
Raymond Fok, Harmanpreet Kaur, Skanda Palani, Martez E. Mott, Walter S. Lasecki · 2018 · Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2018) · doi:10.1145/3234695.3236343
Summary
This University of Michigan study addresses a largely overlooked accessibility gap: while much research has focused on providing deaf users access to spoken output (via captioning or sign language), almost no work has addressed improving deaf users' ability to provide speech input to voice-controlled devices. Many deaf and hard-of-hearing (DHH) individuals can and do speak, but their speech patterns differ from hearing speech due to incomplete acoustic feedback from their own voices — resulting in phonological errors (substitutions, omissions, consonant clustering), vowel prolongation, extraneous pauses, and rhythmic inconsistency. Because ASR systems are trained primarily on hearing speech, they perform poorly on deaf speech. The researchers systematically evaluated both automated (Google Speech Recognition API) and human-powered (Amazon Mechanical Turk crowd workers) approaches to transcribing deaf speech, using the Clarke sentences dataset — audio recordings from 650 DHH individuals with assigned intelligibility scores (0-50 scale). They tested clips at three intelligibility levels (30, 40, 50) encompassing approximately 75% of the DHH population, and also evaluated a real-world dataset of Alexa voice commands spoken by a DHH individual.
Key findings
Baseline results showed both approaches performed poorly on deaf speech: ASR had an average word error rate (WER) of 0.70, while individual crowd workers averaged 0.54 WER — significantly better than ASR but still far too high for practical use (usable transcripts require WER below 0.25). Crowd workers increasingly outperformed ASR at higher intelligibility levels, with 43% lower WER at intelligibility level 50 versus only 11% lower at level 30. Several techniques to improve individual worker performance were tested: speed modification (slowing or speeding clips) had no significant effect, suggesting the core challenge is intelligibility not pace; audio decomposition into single-word segments actually hurt performance by removing linguistic context; and providing thematic context ("these are Alexa commands") improved transcription quality by 26%. The most promising approach was an iterative crowdsourcing workflow where five workers transcribed clips at each step, with all five transcriptions passed to the next round's workers. After 10 iterations, this workflow reduced WER by 52% for intelligibility level 40 and 74% for level 50 compared to individual workers — achieving WER of 0.17 for Alexa commands. However, the iterative approach failed to improve quality at intelligibility level 30, where workers converged on the same incorrect transcription due to a "priming" effect. Transcriptions converged quickly, with the biggest gains between steps 1 and 2, plateauing by step 5.
Relevance
This paper addresses a fundamental equity issue: as computing increasingly moves toward voice-first interfaces (smart speakers, IoT devices, voice assistants), DHH individuals who speak are being excluded from input, not just output. The practical implication is that current ASR systems need significant improvement to handle the diversity of human speech, including deaf speech, dysarthric speech, and heavily accented speech. For system designers, the key finding is that domain-specific context dramatically improves both human and automated recognition of deaf speech — smart home devices and personal assistants operate in bounded domains that should be leveraged. The iterative crowdsourcing approach demonstrates that collective human intelligence can substantially outperform individuals, but the priming/convergence problem at low intelligibility levels shows the limits of this approach and suggests hybrid systems may be needed. The paper also implies that personalized ASR models trained on individual deaf users' speech patterns could be transformative, though this remains computationally challenging for consumer devices.
Tags: deaf and hard of hearing · automatic speech recognition · deaf speech · crowdsourcing · speech intelligibility · voice interface · human computation · smart speaker