The Efficacy of Collaborative Authoring of Video Scene Descriptions

Rosiana Natalie, Jolene Loh, Huei Suen Tan, Joshua Tseng, Ian Luke Yi-Ren Chan, Ebrima H Jarjue, Hernisa Kacorri, Kotaro Hara · 2021 · ASSETS '21: The 23rd International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3441852.3471201

Summary

The vast majority of online video content remains inaccessible to people with visual impairments because it lacks audio descriptions — verbal commentaries that depict visual information in scenes. Professional audio description services cost US$12 to US$75 per video minute and take days to weeks for turnaround, making them impractical for the hundreds of hours of video uploaded to YouTube every minute. Natalie et al. designed and developed ViScene, a web-based collaborative tool that enables sighted novice authors to write scene descriptions (textual descriptions converted to audio via Amazon Polly TTS) with feedback from either sighted or blind reviewers. The tool features a video pane with closed caption and scene description bars that visualize succinctness (turning red when descriptions overlap with dialogue), and a feedback column where reviewers can comment on quality. The researchers developed a nine-code quality codebook grounded in professional audio description guidelines: Descriptive, Objective, Succinct, Learning, Sufficient, Accurate, Referable, Interest, and Clarity. A mixed-design study with 60 sighted novice participants (none with prior audio description experience) evaluated the quality of scene descriptions across three conditions — without feedback, with sighted reviewer feedback, and with blind reviewer feedback — using three different video types: an explainer video about web accessibility, an instructional origami tutorial, and a car advertisement. Both sighted and blind evaluators independently assessed the resulting 360 scene descriptions.

Key findings

Novice authors using ViScene produced scene descriptions that were Descriptive, Objective, Referable, and Clear at a cost of US$2.81 to US$5.48 per video minute — 54% to 96% cheaper than professional services. Sighted reviewer feedback improved the most quality dimensions: descriptiveness, learning, referability, interest, clarity, and sufficiency all showed significant improvements. Blind reviewer feedback was particularly effective at improving objectiveness — a quality highly valued by blind users. This complementarity is a key finding: sighted and blind reviewers improve different quality dimensions, suggesting that mixed-ability collaboration produces the best results. However, both feedback conditions struggled with the Learning quality (how well descriptions convey the video's intended message), with low approval counts across all conditions. The blind evaluator was notably stricter than sighted evaluators on Interest (approving far fewer SDs) and more generous on Descriptiveness and Clarity, revealing that sighted and blind people perceive audio description quality differently. A significant trade-off emerged between Succinct and Descriptive qualities — when authors addressed feedback to be more descriptive, their descriptions often became longer and less succinct. The average time to collaboratively author scene descriptions for a one-minute video was 50-56 minutes, and while this makes ViScene unsuitable for long-form content, it is viable for the many short online videos that currently lack any audio descriptions.

Relevance

This research addresses one of the most significant accessibility gaps on the web: the overwhelming majority of online videos lack audio descriptions, effectively excluding millions of people with visual impairments from a primary medium of information and entertainment. The practical implications are substantial — ViScene demonstrates that video accessibility does not have to be an all-or-nothing proposition between expensive professional services and no descriptions at all. For content creators, organizations, and platform designers, the finding that novices can produce adequate descriptions with collaborative feedback opens up scalable approaches like community-driven audio description, similar to how crowdsourced captioning has expanded. The quality codebook (Descriptive, Objective, Succinct, Learning, Sufficient, Accurate, Referable, Interest, Clarity) is itself a valuable contribution, providing a structured framework for training, evaluating, and improving audio descriptions in any context. The discovery that blind reviewers uniquely improve objectiveness highlights the importance of including disabled end-users in the creation process — not just as consumers but as co-creators of accessibility content. For WCAG compliance efforts, this research points toward practical, cost-effective paths for meeting video accessibility requirements at scale.

Tags: audio description · video accessibility · visual impairment · crowdsourcing · collaborative authoring · scene description · text-to-speech · mixed-ability collaboration

Standards referenced: WCAG 2.0 · CVAA · Section 504 · Section 508