Towards collaborative annotation for video accessibility

Pierre-Antoine Champin, Benoît Encelle, Nicholas W. D. Evans, Magali O.-Beldame, Yannick Prié, Raphaël Troncy · 2010 · Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1805986.1806010

Summary

This paper presents the ACAV (Collaborative Annotation for Video Accessibility) project, a French research initiative involving Dailymotion, the University of Lyon (LIRIS), and EURECOM, aimed at making web video accessible to blind and deaf users through rich, collaborative annotations. The project's approach differs from traditional video accessibility by separating annotations from their rendering modality — the same annotation can be displayed as a subtitle, sent to a Braille device, or read by speech synthesis depending on the user's disability and preferences. The architecture combines automatic speech processing (speech-to-text transcription and speaker diarization) with human collaborative annotation, where community members manually correct automated transcriptions and add descriptions of visual content. A preliminary study with blind users informed the design by revealing that existing audio descriptions are often too verbose and that users need personalized control over description detail. The system leverages several emerging W3C standards including Media Fragments URI for addressing temporal and spatial sub-parts of video, and HTML5 video capabilities for standards-based rendering.

Key findings

The preliminary study with blind participants produced several important requirements for accessible video. Blind users frequently watch programs without audio descriptions and rely on nearby sighted people for ad hoc explanations — an approach that is situationally limited and potentially disruptive to others. Existing audio descriptions were found to be sometimes too verbose, with participants wanting control over both the type of information described (character information, actions, places, time/periods, and visual scenes, ranked in that order of importance) and three levels of verbosity (minimal, normal, complete). The study confirmed that multimodal presentation — combining speech synthesis with Braille display — was greatly appreciated, but Braille description length must be matched to individual reading speeds. The project's metadata model cleanly separates annotations, schemas (categorization and structure), and views (rendering specifications), enabling the same content description to be rendered differently for different users. The collaborative model envisions parents, teachers, and community members contributing annotations that build upon each other — for example, one person adds captions for deaf users while another layers visual descriptions for blind users on the same video.

Relevance

This research anticipated many of the challenges and approaches that now define video accessibility on the web. The core idea of modality-independent annotations — where the same metadata can be rendered as captions, audio, or Braille depending on user needs — remains an aspirational model for how video accessibility should work. The finding that users want granular control over description verbosity and type challenges the one-size-fits-all approach still common in audio description practice. For practitioners today, the collaborative annotation model offers a scalable alternative to professional-only captioning and description services, particularly relevant given the explosion of user-generated video content. The project's integration of automatic speech recognition with human correction foreshadowed the hybrid AI-human workflows now common in captioning platforms. The emphasis on Braille display integration for web video is notable as this modality remains underserved in most video accessibility implementations.

Tags: video accessibility · audio description · captioning · crowdsourcing · speech recognition · Braille · multimodal interaction · semantic web · collaborative annotation

Standards referenced: WCAG 2.0 · W3C Media Fragments URI · HTML5