Audio-Visual Speech Understanding in Simulated Telephony Applications by Individuals with Hearing Loss

Linda Kozma-Spytek, Paula Tucker, Christian Vogler · 2013 · Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS) · doi:10.1145/2513383.2517032

Summary

This paper presents two within-subjects experiments (conducted in 2009 and 2012 with 24 and 22 participants respectively) investigating how video frame rate and audio-video synchrony affect speech understanding by people with hearing loss during simulated video telephone calls. Participants were hard of hearing adults who use voice to communicate and wear hearing devices (hearing aids and/or cochlear implants). The experiments used standardized sentence sets from the CASPER speech perception evaluation system, presented through a simulated wireless phone setup — a flat panel monitor masked with a cardboard cutout mimicking a phone screen. The 2009 experiment tested low frame rates (7.5 and 15 fps) at QCIF resolution typical of 3G networks, with varying audio-video delays. The 2012 experiment replicated key conditions and extended testing to 30 fps at near-CIF resolution, reflecting improvements in LTE networks and devices like the iPhone with FaceTime. Participants repeated sentences they heard/saw, and staff scored words correct — a verbatim repetition method standard in audiology that provides finer-grained assessment than comprehension questions.

Key findings

Adding video to audio significantly improved speech understanding across nearly all conditions, confirming that lipreading enhancement applies to telephony contexts. Higher frame rates produced significantly better performance: increasing from 7.5 to 15 fps yielded significant improvement, as did 7.5 to 30 fps, though the difference between 15 and 30 fps was not significant in the 2009 data. The most striking finding concerned audio-video synchrony asymmetry: when audio was perceived 100 ms ahead of video, speech understanding dropped significantly; when audio was perceived 100 ms behind video, understanding did not degrade compared to perfect synchrony. This asymmetry is explained by the natural 200 ms lead of mouth movements over speech sounds — at low frame rates, early audio eliminates the predictive value of visual cues, while slightly delayed audio preserves it. A critical validation finding revealed that playback hardware and software introduce unpredictable audio-video delays: Windows systems consistently played audio 100 ms early, MacBook Air played audio 30 ms late, Samsung phones varied between 50-90 ms late. This means that encoded synchrony does not guarantee perceived synchrony, posing a serious threat to video accessibility for people with hearing loss.

Relevance

This research has direct implications for video calling platforms, streaming services, and any technology that combines audio and video for communication. The finding that small timing differences (100 ms) can significantly impact speech understanding for people with hearing loss means that developers of video telephony applications need to pay careful attention to audio-video synchronization — a technical detail often considered negligible for typical users. The practical recommendation is clear: if perfect synchrony cannot be guaranteed, it is better to slightly delay audio relative to video rather than the reverse. For accessibility practitioners, the study highlights that telecommunications accessibility for hard of hearing users extends well beyond captioning and hearing aid compatibility — video quality parameters like frame rate and synchrony are accessibility concerns. The discovery that hardware and software introduce unpredictable timing shifts across platforms underscores the need for technical standards governing perceived audio-video synchrony in video communication tools.

Tags: hearing loss · hard of hearing · lipreading · speechreading · video telephony · telecommunications accessibility · audio-video synchronization · frame rate · cochlear implant

Standards referenced: 3GPP