Analysis of Speech Properties of Neurotypicals and Individuals Diagnosed with Autism and Down Syndrome
Mohammed E. Hoque · 2008 · Proceedings of the 10th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '08) · doi:10.1145/1414471.1414554
Summary
This MIT Media Lab study systematically compares speech properties across three groups — neurotypicals (NT), individuals with autism spectrum disorder (ASD), and individuals with Down syndrome (DS) — using 100 minutes of audio data from 10 natural one-to-one conversations. Six participants took part: two neurotypicals, three diagnosed with mild to moderate autism, and one with Down syndrome, recorded at the Groden Center in Providence, Rhode Island. The recording setup used a MacBook connected to an analog camera via an analog-to-digital converter, with each participant having their own recording system. Conversations followed a question-and-answer format led by the NT partner. Over 50 speech features related to segmental and suprasegmental properties were extracted using the Praat speech processing software, including utterance-level statistics for fundamental frequency (F0), duration, rhythm, voice quality, intensity, and formants. Feature mining was then performed using the WEKA machine learning toolkit to identify which features best distinguish the three groups, employing search techniques including best-first, greedy stepwise, and ranker methods with Consistency Subset Evaluator and Chi-Squared Attribute evaluator.
Key findings
The average duration per turn was longer for NTs than for ASD or DS participants, consistent with the experimental design where NTs led conversations. The energy parameter in speech yielded much higher values for DS compared to NT and ASD, possibly reflecting a tendency toward being easily excited. NTs used pauses more proportionately within utterances compared to ASD and DS participants. The magnitudes of maximum rising and falling edges in an utterance/turn were higher in NTs, then DS, then ASD — however, the number of rising and falling edges was comparable between ASD and NT, suggesting that individuals with ASD may be capable of responsive intonation patterns but often fail to articulate them with appropriate parameters. Feature mining revealed that speech features similar across all three groups included voice quality features (jitter, shimmer), speaking rate, pause parameters, and second formant values. The features that most distinguished the groups (in order of significance) were minimum pitch, mean pitch, maximum pitch, mean intensity, values of first and third formants, minimum intensity, energy, and bandwidths of first and third formants.
Relevance
This study provides foundational data for building speech visualization and feedback technologies for individuals with ASD and Down syndrome. The finding that people with ASD produce a comparable number of intonation changes but with inappropriate parameters suggests that real-time visual feedback on specific speech features like pitch range and intensity could help them calibrate their speech production — a more tractable problem than teaching entirely new speech patterns. For accessibility practitioners and assistive technology developers, the identification of which speech parameters are similar versus different across groups is directly useful for designing targeted interventions. The work also challenges the common misconception that speech difficulties in ASD and DS reflect lack of intelligence or social disinterest, highlighting instead that these are specific production challenges that technology could help address.
Tags: speech processing · autism · Down syndrome · speech production · prosody · assistive technology · affective computing · speech analysis