An Enhanced Electrolarynx with Automatic Fundamental Frequency Control based on Statistical Prediction

Kou Tanaka, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura · 2015 · ASSETS '15: Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility · doi:10.1145/2700648.2811340

Summary

This demonstration paper presents a prototype system that enhances electrolarynx speech by automatically controlling pitch (fundamental frequency, or F0) using statistical prediction. An electrolarynx is a speaking aid device used by laryngectomees—people who have had their larynx surgically removed, typically due to laryngeal cancer. The device generates mechanical vibrations that are held against the neck and conducted into the oral cavity, where they are articulated into speech. While electrolaryngeal (EL) speech is quite intelligible, it sounds distinctly unnatural because the mechanical excitation produces monotonous, flat pitch patterns rather than the natural pitch variations that convey emotion, emphasis, and linguistic meaning. The researchers developed a method that predicts natural fundamental frequency patterns from the EL speech signal in real-time using a statistical model trained on parallel recordings of EL speech and natural speech. The system operates in two simultaneous processes: prediction (analyzing the incoming EL speech and predicting appropriate F0 values frame-by-frame) and articulation (the user speaking with the electrolarynx whose frequency is automatically modulated based on predictions). This allows laryngectomees to produce more naturally-sounding speech while using the device in the same familiar manner as a conventional electrolarynx. The prototype implementation uses a laptop computer, a standard close-talk microphone, a digital-to-analog converter, and a commercial electrolarynx (Yourtone). The system introduces 150ms total latency: 50ms for the prediction algorithm and 100ms for D/A conversion, though the authors note specialized hardware could reduce this.

Key findings

Objective evaluation demonstrated that the prototype system achieves an F0 correlation coefficient of 0.91 compared to the earlier simulation-based system, indicating that the physical implementation closely matches the performance validated in prior simulation studies. Visual comparison of F0 contours shows that the prototype produces speech with naturally varying pitch patterns, in contrast to the flat, monotonous patterns of conventional EL speech. Previous simulation-based evaluation (referenced in the paper) had demonstrated that the statistical F0 prediction method yields significant improvements in perceived naturalness while causing no degradation in listenability or intelligibility compared to original EL speech. The 50ms processing delay required for prediction creates some misalignment between articulated sounds and F0 patterns, but prior perceptual studies found this impact to be minimal. The system was trained and tested using approximately 50 sentences from the ATR phonetically balanced sentence set, with 5-fold cross-validation using 40 utterance pairs for training and 10 for evaluation. The source speech was EL speech from a non-disabled male speaker, and target patterns came from natural speech by a professional female speaker.

Relevance

This work addresses a significant quality-of-life issue for laryngectomees. While electrolarynx devices restore the ability to communicate verbally after laryngectomy, the robotic, monotonous quality of EL speech can be socially stigmatizing and emotionally distressing for users. By making EL speech sound more natural without requiring changes to how the device is used, this technology could improve social acceptance and psychological wellbeing for laryngectomees. For assistive technology practitioners, this demonstrates how machine learning can enhance existing assistive devices rather than replacing them—an important design principle that respects users' existing skills and habits. The approach of predicting natural speech patterns from impaired speech signals has potential applications beyond electrolarynx enhancement, such as improving synthetic speech for other voice disorders. As a demonstration paper, this work presents proof-of-concept rather than comprehensive user evaluation with laryngectomees. The 150ms latency in the current prototype, while acceptable, would benefit from dedicated hardware development. Future work should include evaluation with actual laryngectomees and assessment of long-term usability and user satisfaction.

Tags: electrolarynx · laryngectomy · speech synthesis · assistive technology · voice prosthesis · fundamental frequency · machine learning · speech rehabilitation