Annotation-based Video Enrichment for Blind People: A Pilot Study on the Use of Earcons and Speech Synthesis

Benoît Encelle, Magali Ollagnier-Beldame, Stéphanie Pouchot, Yannick Prié · 2011 · Proceedings of the 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2011) · doi:10.1145/2049536.2049560

Summary

This paper presents exploratory work from the ACAV (Collaborative Annotation for Video Accessibility) project, investigating how combining earcons (nonverbal audio messages) with speech synthesis can improve video accessibility for blind people. Traditional audio description has limitations: it only uses the verbal modality, is expensive to produce (over 5000 Euros for a 90-minute film in France), and follows a one-size-fits-all approach. The ACAV system takes a different approach based on video annotations that are rendered as enrichments during playback. Annotations are associated with temporal video fragments and can be rendered through visual, audio, or tactile modalities. The system separates annotation content from its rendering, allowing different presentation models to be applied and enabling end-user customisation. The researchers developed an annotation schema called VisualBase with types for Actions, Sets (locations/scenes), and TextOnScreen, then created multiple presentation models combining earcons and speech synthesis in different ways. Two pilot studies were conducted with 21 legally blind volunteers (aged 23-72) using two short humorous videos, testing four experimental conditions for how earcons and speech synthesis were combined to convey set changes.

Key findings

The studies confirmed three main findings. First, earcons are readily perceptible by blind users — 85% of participants heard them in both videos, with the best results when a preliminary lexicon explaining earcon meanings was presented before the video (condition S1). Second, earcons combined with speech synthesis enhance understanding of video content, particularly for conveying set/location changes. Participants achieved 69% and 60% correct answers on set-related story comprehension questions for the two videos respectively, compared to control participants who could only "reconstruct" about 50% of information. The simplified speech synthesis descriptions (concise rather than detailed) were preferred, with participants emphasising that conciseness should dominate over exhaustiveness. Third, a notable side effect emerged: earcons can distort the perception of video rhythm. Participants perceived the slower-paced video as fast when earcons marked set changes, because earcons draw attention to discontinuities. This illustrates what researchers call the "Jaskanen paradox" — enrichments must be perceived but must not perturb the viewing experience. Participants could handle up to 6 different earcons, and strongly valued having a spoken prologue/synopsis before the video began. The presentation model where each unique earcon was accompanied by speech synthesis explaining its meaning (S4) showed the best results for perceiving the number of set changes.

Relevance

This research offers an innovative alternative to traditional audio description that could make video accessibility more scalable and customisable. For accessibility practitioners, the key insight is that nonverbal audio cues (earcons) can effectively complement speech synthesis, reducing the verbal cognitive load that comes with describing everything through speech alone. The annotation-based approach, where content is separated from rendering, enables personalisation — users could adjust verbosity levels, earcon preferences, and presentation timing. This is particularly relevant as video content continues to dominate the web and the demand for accessible video far outstrips the supply of professional audio description. The collaborative annotation model, where community members can contribute descriptions that are then rendered through customisable presentation models, could democratise video accessibility production. The finding about rhythm perception is an important design consideration: audio enrichments must balance informativeness with non-intrusiveness, and automatic set-change detection should be applied carefully to avoid disrupting the viewing experience.

Tags: video accessibility · blindness · audio description · earcons · speech technology · multimedia accessibility · sonification

Standards referenced: WCAG 2.0