Making Accessible Movies Easily: An Intelligent Tool for Authoring and Integrating Audio Descriptions to Movies

Ming Shen, Gang Huang, Yuxuan Wu, Shuyi Song, Sheng Zhou, Liangcheng Li, Zhi Yu, Wei Wang, Jiajun Bu · 2024 · Proceedings of the 21st International Web for All Conference (W4A) · doi:10.1145/3677846.3677855

Summary

This paper introduces EasyAD, an intelligent tool that automates the process of authoring and integrating audio descriptions (AD) into movies for blind and visually impaired (BVI) users. The traditional AD production workflow is highly labor-intensive, requiring authors to review entire films, identify speech gaps where descriptions can be inserted, write AD scripts that fit within those gaps, record voiceovers, and edit them into the movie. EasyAD streamlines this pipeline through five automated steps: pinpointing subtitle positions using frame selection, recognizing subtitles via Paddle OCR (based on CNNs and RNNs), detecting speech gaps by analyzing subtitle timing, generating AD scripts using a multimodal large language model, and converting scripts to speech using Paddle Speech for integration via FFmpeg. A key innovation is how EasyAD handles speech gap detection — rather than using speech-to-text transcription which often misidentifies background music as speech, EasyAD uses character recognition on on-screen subtitles, which is particularly effective for Chinese movies where subtitles are standard. For AD generation, EasyAD is the first tool to incorporate a multimodal large language model (Video-Chat) into the production pipeline, combining visual scene understanding with contextual dialogue information from subtitles to produce richer descriptions. The tool has been operational at the China Braille Library for three months and was built using PyQt5 for its interactive interface.

Key findings

A user study with six experienced AD authors from the China Braille Library (3+ years experience each) demonstrated that EasyAD reduced total processing time for a medium-difficulty movie by nearly 50%, from 32 hours to 16.3 hours. The most dramatic improvements came in speech gap detection (85% reduction, from 5 hours to 0.7 hours) and dubbing/integration (90% reduction, from 7 hours to 0.6 hours). AD authoring time decreased by 25%, from 20 hours to 15 hours. Participants reported that the AI-generated descriptions accurately captured important visual content and served as a strong starting point, though they noted that automatically generated AD may lack fine details like facial expressions. Authors appreciated that EasyAD gave them the option to publish directly without modifications when time is limited, enabling an accessible movie release within 3 hours. The tool addresses a significant gap in China, where the China Braille Library currently offers only 220 accessible movies despite nearly 20 million BVI individuals in the country, partly because existing tools lack Chinese language support and require expensive software like PremierePro or CapCut.

Relevance

This research demonstrates the practical impact of applying AI — particularly multimodal large language models — to accessibility content production at scale. The nearly 50% reduction in production time has real implications for the availability of accessible media, especially in regions where AD production infrastructure is limited. For accessibility practitioners, EasyAD illustrates a human-in-the-loop approach where AI handles the tedious mechanical tasks (gap detection, dubbing, integration) while authors retain creative control over the descriptive content. The subtitle-based approach to speech gap detection is a clever solution that avoids common pitfalls of audio analysis. However, the tool currently depends on on-screen subtitles, limiting its applicability to films without them. Future work aims to add speech transcription for unsubtitled content and multilingual support. The Web Content Accessibility Guidelines recommend AD for all online video, making tools that reduce production barriers directly relevant to WCAG compliance efforts.

Tags: audio description · blind and low vision · media accessibility · multimodal AI · speech synthesis · video accessibility · content creation

Standards referenced: WCAG