Describing online videos with text-to-speech narration

Masatomo Kobayashi, Tohru Nagano, Kentarou Fukuda, Hironobu Takagi · 2010 · Proceedings of the 2010 International Cross Disciplinary Conference on Web Accessibility (W4A) · doi:10.1145/1805986.1806025

Summary

This paper from IBM Research Tokyo presents a technology platform that uses text-to-speech (TTS) synthesis to add audio descriptions (AD) to online videos at minimal cost. The system addresses the two main barriers that prevent most online video creators from providing audio descriptions: the expertise needed to write AD scripts that fit within gaps between dialogue, and the requirement for professional narrators and recording studios. By replacing human narration with synthesized speech, the platform reduces AD creation to a text authoring task that non-experts can perform. The architecture consists of three components: a script editor with a visual timeline interface for writing and timing AD sentences, a video player (built as a plugin for the aiBrowser accessible web browser) that synchronizes synthesized narration with video playback, and a metadata repository for storing and sharing AD scripts as external metadata. The script editor displays the original audio waveform to help authors find appropriate insertion points and shows expected TTS duration, supporting an iterative trial-and-error workflow. AD scripts can be synthesized either server-side (for devices with limited computing power) or client-side (when bandwidth is limited).

Key findings

User studies with blind and visually impaired participants produced encouraging results for synthesized audio description. High-quality TTS was found to be comparable to professional human narration in both intelligibility and user preference. On a five-point acceptability scale, synthesized AD scored 3.96 versus 4.21 for human narration on cartoon videos, and 3.63 versus 4.63 on drama videos. Critically, even low-quality TTS was still considered acceptable and greatly improved the experience compared to having no audio description at all. Participants explicitly reported wanting many more audio descriptions regardless of speech quality — indicating that availability matters more than production polish. The authors note that commercial TTS engines can serve as viable alternatives to human narrators, while open-source engines like eSpeak, which support many languages, could extend audio description to populations in developing regions where professional narration services are unavailable.

Relevance

This research anticipated the modern shift toward AI-generated audio descriptions that is now accelerating with advances in speech synthesis and large language models. The core finding — that users strongly prefer any audio description over none, even with imperfect synthesis — provides an important practical argument for organizations hesitant to invest in professional AD production. The external metadata approach, allowing third parties to create and share AD scripts without requiring action from content owners, offers a scalable model for addressing the vast backlog of undescribed online video. For practitioners today, the platform's design principles remain instructive: visual timeline editing, waveform display for identifying dialogue gaps, real-time TTS preview, and configurable voice parameters (gender, speed) all represent best practices for AD authoring tools. The observation that open-source TTS can extend accessibility to underserved languages is particularly relevant as audio description requirements expand globally through legislation like the European Accessibility Act.

Tags: audio description · text-to-speech · video accessibility · speech synthesis · external metadata · visual impairment · assistive technology

Standards referenced: WCAG 2.0 · SSML · SMIL 3.0 · Timed Text