Individuality-Preserving Voice Conversion for Articulation Disorders Using Phoneme-Categorized Exemplars

Ryo Aihara, Tetsuya Takiguchi, Yasuo Ariki · 2015 · ACM Transactions on Accessible Computing · doi:10.1145/2738048

Summary

This paper presents a voice conversion system designed to improve speech intelligibility for people with articulation disorders resulting from athetoid cerebral palsy, while critically preserving the speaker's voice individuality. Cerebral palsy affects about 2 in 1,000 births, with 10-15% developing athetoid symptoms that cause involuntary movements affecting speech—particularly consonant production, which becomes unstable and unclear. Standard speech recognition systems achieve only 3.5% accuracy for speakers with such articulation disorders using speaker-independent models, highlighting severe communication barriers. The researchers developed a method using Nonnegative Matrix Factorization (NMF) that creates a "combined dictionary" containing the source speaker's vowels (which remain relatively stable and carry voice identity) paired with consonants from a target speaker without articulation disorders. This approach converts disordered speech into clearer speech while maintaining the original speaker's voice characteristics—important because people with articulation disorders want to communicate in their own voice, not sound like someone else.

Key findings

The proposed phoneme-categorized subdictionary method outperformed both conventional Gaussian Mixture Model (GMM)-based voice conversion and standard NMF-based approaches. Testing with a 60-year-old Japanese man with severe athetoid-type cerebral palsy, the system was evaluated on 432 utterances using both objective measures and subjective listening tests with 10 participants. All voice conversion methods improved listening intelligibility for more than 50% of samples compared to unconverted speech. Critically, the proposed method scored significantly higher than alternatives on two key measures: similarity to the source speaker's original voice and naturalness of the converted speech. The phoneme classification using categorizing dictionaries achieved 47% accuracy with optimal configuration, sufficient for effective conversion despite the challenging input. Spectrograms showed the converted speech retained vowel characteristics while clarifying previously indistinct consonants.

Relevance

This research addresses a fundamental tension in assistive speech technology: the need to improve intelligibility while respecting the speaker's identity and autonomy. Many people with speech disabilities reject synthesized voices that sound artificial or generic. This individuality-preserving approach represents a more respectful model for speech assistance. The work is particularly relevant for people with athetoid cerebral palsy, who often cannot use sign language or writing due to the same motor symptoms affecting their speech. Limitations include testing with only one speaker with articulation disorders, the computational overhead compared to GMM methods, and challenges with certain phoneme categories (nasals, semivowels, liquids). Future applications could extend to other motor speech disorders and potentially real-time communication aids.

Tags: voice conversion · articulation disorders · cerebral palsy · speech technology · assistive technology · motor speech disorders · signal processing