Development of a Real-time Bionic Voice Generation System based on Statistical Excitation Prediction

Farzaneh Ahmadi, Kazuhiro Kobayashi, Tomoki Toda · 2019 · Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2019) · doi:10.1145/3308561.3354591

Summary

This demonstration paper presents the first real-time implementation of the Pneumatic Bionic Voice (PBV) system, a voice prosthesis for people who have undergone laryngectomy — surgical removal of the larynx, typically due to advanced throat cancer. Without a larynx, a person loses both their vocal folds (the excitation source for speech) and the airway connection between lungs and mouth, breathing instead through a stoma (opening) in the neck. Existing voice prostheses include tracheoesophageal puncture (surgical, invasive), esophageal speech (difficult to learn), and electrolarynx devices (external vibration source held against the neck, producing robotic-sounding voice). The PBV is an electronic adaptation of the Pneumatic Artificial Larynx (PAL), a mechanical device placed externally between the stoma and mouth that acts as a fixed pair of vocal folds driven exclusively by the user’s respiration. The PAL produces exceptionally high-quality, natural-sounding voice among non-invasive voice prostheses, but the physical device has limitations. The PBV aims to replicate the PAL’s voice generation mechanism electronically using statistical voice conversion, converting respiration pressure signals into synthesised voice in real time. The system uses Arduino-based pressure sensors to capture respiration at the stoma and mouth at 1 kHz, extracts features (mel-cepstral coefficients, fundamental frequency, aperiodicity), maps them through Gaussian Mixture Models trained on PAL recordings, and synthesises voice using a WORLD vocoder.

Key findings

The real-time PBV system achieved an average processing delay of only 23ms between respiration input and synthesised audio output — well within the threshold for natural-feeling speech production. The system closely matched its offline counterpart: fundamental frequency estimation accuracy was 98% compared to the offline version, and the spectral distortion between real-time and offline systems was only 3.9 dB in mel-cepstral distance. The respiration feature extraction difference between real-time and offline systems for 45-second recordings was on the order of 10⁻⁷, indicating near-identical processing. The implementation used a dual-threaded architecture on a standard iMac (C++, 32GB RAM, 4.2 GHz Intel Core i7): one thread for respiration-to-voice conversion and another for audio callback, synchronised via shared memory rather than an external clock. The system uses UDP protocol for receiving pressure sensor data, chosen for its connectionless nature that eliminates waiting times. A key technical challenge was adapting the offline batch-processing approach (which benefits from cross-frame correlations) to a frame-by-frame real-time system using low-delay maximum likelihood parameter generation (MLPG) with a recursive algorithm requiring only n=2 look-ahead frames (10ms additional delay). The next step is live trials with laryngectomees.

Relevance

This work addresses an underserved population in assistive technology: the approximately 60,000 people who undergo laryngectomy annually worldwide. Current voice prostheses involve significant trade-offs — tracheoesophageal puncture requires surgery and ongoing maintenance, esophageal speech takes months to learn with limited success rates, and electrolarynx devices produce distinctly robotic voice that carries social stigma. The PAL produces the most natural-sounding non-invasive voice but is a physical mechanical device with practical limitations. The PBV’s approach of electronically modelling the PAL’s voice generation from respiration signals alone opens the possibility of a compact, wearable device that produces natural-sounding speech controlled by the user’s own breathing patterns — preserving the intuitive respiratory control of speech that laryngectomees already understand. For accessibility practitioners, this work demonstrates how statistical voice conversion techniques originally developed for voice morphing and text-to-speech can be repurposed for assistive applications. The 23ms latency achievement is particularly significant, as real-time responsiveness is critical for natural conversational speech.

Tags: voice prosthesis · laryngectomy · speech accessibility · voice conversion · assistive technology · bionics · signal processing