Machine Generation of Audio Description for Blind and Visually Impaired People
Virgínia P. Campos, Tiago M. U. de Araújo, Guido L. de Souza Filho, Luiz M. G. Gonçalves · 2023 · ACM Transactions on Accessible Computing · doi:10.1145/3590955
Summary
This paper presents an extension to CineAD, a system for automatically generating audio descriptions (AD) for videos. The authors address a critical accessibility gap: most videos, films, and cultural programming lack audio descriptions, leaving blind and visually impaired (BVI) users unable to fully access visual media. The research combines information from video scripts with computer vision analysis to generate AD automatically, reducing dependency on expensive professional AD production. The system architecture includes several integrated components: a Controller that orchestrates the process, Gap Identification that finds dialogue pauses where AD can be inserted, a Script Analyzer that extracts action descriptions from screenplays, a Video Analyzer using YOLOv2 for object detection and GoogLeNet for scene classification, an AD Script Generator that combines textual and visual information, and a Speech Synthesizer using Amazon Polly or Espeak for Brazilian Portuguese output. The Video Analyzer processes frames to detect objects (achieving 22-30% correct classification per frame) and classify scenes, complementing the script-derived information. The research was conducted in Brazil with Brazilian Portuguese content, addressing a market where AD availability is particularly limited. The system was implemented as a prototype with both a user interface and web service API, allowing users and applications to generate AD tracks for their videos automatically.
Key findings
User evaluation with 11 Brazilian blind participants produced significant results. Videos with machine-generated AD (combining script and video analysis) achieved 71.8% correct answers on comprehension questions, compared to just 13.3% without any AD—a 58.47 percentage point improvement that was statistically significant (Mann-Whitney U test, p < 0.001). Importantly, the machine-generated AD proved more efficient: it used less video time for descriptions while achieving similar comprehension levels compared to script-only AD. The system generated more "objective and succinct" descriptions by combining script information with detected objects and actions. In video-only scenarios (no script available), qualitative evaluation showed strong results: all 11 users correctly identified character counts and genders, 10/11 understood the overall story, and 8/11 found the AD easy to understand. However, spatial and temporal location understanding was weaker (6-10 users successful), indicating areas for improvement. The technical analysis revealed that AD occupied 32-45% of available video time, with the system successfully identifying dialogue gaps and inserting contextually appropriate descriptions.
Relevance
This research directly addresses a major accessibility barrier: the scarcity of audio-described content. In Brazil, as in many countries, most videos lack AD entirely, making machine generation a practical necessity rather than merely a convenience. The finding that automated AD significantly improves comprehension (from 13% to 72%) validates the approach even when the AD quality is imperfect. For accessibility practitioners, this work demonstrates that combining multiple data sources (scripts plus computer vision) produces more efficient AD without sacrificing comprehension. Organizations creating video content could use similar systems to provide baseline AD quickly, potentially as a foundation for professional refinement. The limitations are instructive: object detection accuracy of 22-30% suggests current computer vision has gaps, and users reported wanting more detailed scene and character descriptions. Future implementations should balance automation efficiency with descriptive richness. The methodology—comparing against no-AD baselines rather than only professional AD—reflects the real-world situation most BVI users face.
Tags: audio description · blind and visually impaired · computer vision · machine learning · video accessibility · text-to-speech · object detection · automation