Challenges in Automatic Speech Recognition for Adults with Cognitive Impairment

Michelle Cohn, Alyssa Lanzi, Yui Ishihara, Chen-Nee Chuah, Georgia Zellou, Alyssa Weakley · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791581

Summary

This CHI 2026 paper quantifies how well state-of-the-art automatic speech recognition (ASR) handles voice commands produced by older adults with cognitive impairment, and asks which acoustic features actually predict transcription accuracy. The authors draw on the Voice Assistant System (VAS) corpus from DementiaBank — recordings of 101 older adults reading 30 scripted Amazon Alexa commands (IoT controls, questions, and reminders) either in person or virtually. After exclusions for task compliance and cognitive severity, 83 speakers remained: 30 cognitively normal (CN), 28 with mild cognitive impairment (MCI, MoCA 20-25), and 25 with a clinical diagnosis of dementia. Each utterance was passed once through the Whisper 'small' (244M) and 'medium' (769M) English ASR models, which can run locally for HIPAA-compatible processing, and word error rate (WER) was computed against human-annotated transcripts. Acoustic features — speech rate, articulation rate, pause ratio, intensity, jitter, shimmer, mean pitch, pitch variation — were extracted in Praat and modeled with mixed-effects regression, controlling for age, gender, race, education, and prior voice-technology experience. The paper is framed explicitly through Ability-Based Design: rather than asking users to adapt to fragile ASR, voice systems must adapt to the speech production patterns of people living with dementia.

Key findings

Dementia produced a significantly higher WER than cognitively normal speech (coef=0.14, t=5.17, p<0.001), more than double the CN rate (0.21 vs. ~0.08), while MCI was statistically indistinguishable from CN on these scripted commands. The gap was starkest for IoT commands such as 'Alexa, turn the bedroom light on' — precisely the utterances aging-in-place systems depend on — where the Whisper 'small' model reached nearly 0.5 WER for dementia speakers; the larger 'medium' model closed the gap substantially but did not eliminate it. Acoustic analysis confirmed that speakers with dementia were slower, had lower articulation rates, produced quieter speech (lower intensity), had more voice-quality perturbation (higher shimmer), lower mean pitch, and — counter to prior spontaneous-speech findings — a lower pause ratio, likely because they paused less after the wake word 'Alexa.' In the combined acoustic model, three features reliably predicted WER: higher shimmer, lower intensity, and lower pause ratio all worsened transcription. Two dementia participants produced so few recognizable wake words they were excluded from the accuracy analysis entirely, pointing to an additional barrier upstream of transcription. The paper argues these patterns are not edge cases but predictable consequences of well-documented dementia-related speech changes that ASR systems, trained largely on neurotypical adult speech, fail to accommodate.

Relevance

For anyone designing voice-based AgeTech, smart-home accessibility, or clinical conversational agents, this paper offers a rare empirical link between specific acoustic dimensions (intensity, shimmer, pause ratio) and downstream recognition failures — the kind of evidence that can drive concrete ASR training-data and interaction-design decisions rather than vague 'support older adults' language. Its four HCI design directions are actionable: speaker-personalized ASR via brief calibration tasks and periodic fine-tuning; human-in-the-loop transcript review by users or authorized caregivers with privacy controls; interaction-level adaptation such as dynamic microphone gain, longer end-of-utterance windows, and delayed turn-taking for slower speakers; and redundant non-voice engagement paths (e.g., physical buttons) for users who struggle with wake words. Limitations are explicit: the study analyzes read rather than spontaneous speech (likely underestimating real-world WER), uses one language and U.S. cultural context, relies on MoCA rather than clinical diagnosis for the MCI group, and does not distinguish dementia etiologies (Alzheimer's, vascular, Parkinson's) that may produce different linguistic patterns. The authors also note that larger Whisper models reduce but do not erase disparities, reinforcing that scaling alone is not an accessibility strategy for cognitive impairment.

Tags: automatic speech recognition · ASR · dementia · Alzheimer's disease · mild cognitive impairment · AgeTech · voice assistant · Amazon Alexa · Whisper · aging in place · ability-based design · acoustic analysis · word error rate · smart home · human-in-the-loop

Standards referenced: HIPAA