← All reviews

Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation

Ian V. McLoughlin, Hamid Reza Sharifzadeh, Su Lim Tan, Jingjie Li, Yan Song · 2015 · ACM Transactions on Accessible Computing · doi:10.1145/2737724

Summary

This paper addresses a fundamental communication barrier for people who can only whisper due to voice impairments. While whispering is an occasional choice for most people, it is the primary—sometimes only—communication method for partial laryngectomees, those on prescribed voice rest following laryngeal surgery, and people with conditions like dysphonia or dysarthria. Current speech technology systems largely assume phonated (voiced) speech, leaving whisper-only speakers unable to use voice-based communication devices or be easily understood on phone calls. The research presents an improved whisper-to-speech reconstruction system based on sine wave synthesis with a novel formant-derived pitch modulation technique. Unlike training-based approaches that require parallel recordings of whispered and voiced speech from each user, this parametric system requires no prior information and can work immediately for any speaker. The key innovation is deriving artificial pitch (fundamental frequency f0) from the differences between formant frequencies (F3-F2 and F2-F1), creating pitch variation that tracks speech content rather than using flat or arbitrary pitch contours. The system processes whispers through LPC analysis to extract formant frequencies and magnitudes, applies frequency shifts to compensate for whisper-specific formant differences, synthesizes formants as pure sine waves, and modulates the result with an artificially generated pitch signal. This approach avoids hard voiced/unvoiced switching decisions that are error-prone with whisper input.

Key findings

The enhanced system ("New SWS") was evaluated against four alternatives: original whispers, electrolarynx (EL) speech, a CELP-based reconstruction method, and the authors' earlier sine wave system. Testing used recordings from seven speakers producing 12 vowel-framing words in spoken, whispered, and EL conditions. Objective evaluation using five distance measures (Cepstral, W-Cep, MFCC, Itakura-Saito, Log-Likelihood Ratio) showed the New SWS method significantly outperformed alternatives. Cepstral distance improved from 0.414 (whispers) to 0.334 (New SWS), MFCC distance dropped from 62.96 to 56.59, and Itakura-Saito distance decreased from 109.60 to 2.11. ANOVA confirmed these differences were statistically significant (p < 0.01 for most measures). Subjective evaluation with 16 listeners using Mean Opinion Score (MOS) showed the New SWS method (MOS 2.19) outperformed the electrolarynx (1.77), though all reconstruction methods scored below 3.0, indicating the speech still sounds artificial. Listeners described EL speech as "robotic and annoying" while New SWS was "robotic but slightly easier to listen to." The formant-derived pitch produces variation that tracks phoneme boundaries, creating more natural-sounding prosody than flat-pitch alternatives.

Relevance

This research offers a noninvasive prosthetic solution for people with voice production impairments who retain the ability to whisper. Unlike surgical interventions like tracheoesophageal puncture or external devices like electrolarynx, whisper-to-speech conversion requires no physical modification or visible apparatus—just software processing of the user's whispered input. The training-free nature of this approach is particularly valuable for accessibility. Users can begin immediately without lengthy enrollment sessions, and the system works across different speakers without personalization. This matters for voice rest patients who need temporary assistance, and for healthcare settings where quick deployment is essential. However, the MOS scores below 3.0 indicate significant room for improvement before such systems could serve as primary communication aids. The authors note that reconstruction quality remains "insufficient" for natural-sounding output. For practitioners, this research demonstrates the feasibility of whisper reconstruction while highlighting the gap between current technology and the quality threshold needed for comfortable everyday use. The open availability of MATLAB source code enables further research and potential real-world applications as the technology matures.

Tags: speech technology · voice reconstruction · laryngectomy · voice disorders · whisper-to-speech · assistive technology · speech prosthesis