ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Maryam S Cheema, Sina Elahimanesh, Pooyan Fazli, Hasti Seifi · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) · doi:10.1145/3772363.3798744

Summary

Cheema and colleagues (Arizona State University and Saarland University) present ViDscribe, a web platform that layers AI-generated audio description (AD) and conversational visual question answering (VQA) on top of arbitrary YouTube videos for blind and low vision (BLV) viewers. The paper's core argument is that prior AI-driven AD research has defaulted to one-size-fits-all descriptions evaluated in single-session lab studies, and has largely ignored two things BLV users need: on-demand customization of what and how much gets described, and the ability to ask follow-up questions when the narration leaves gaps. ViDscribe exposes six customization controls - frequency (every 8/15/30 seconds), description length (15-100 words), emphasis (general, character, instructional, environment), subjectivity (objective vs interpretive), color inclusion, and a free-form instruction textbox - along with a VQA button that sends the user's typed or spoken question plus the current timestamp and frames to Gemini 3 Pro for a context-aware answer. AD timing is decoupled from content generation: a separate module finds insertion points by intersecting silence, no-speech, and scene-change signals, then Gemini 3 Pro writes descriptions constrained by the user's settings and 42 prior-work AD guidelines. The frontend is React with AWS Lambda backend, screen-reader compatible, and keyboard-navigable. The evaluation is a longitudinal, in-the-wild study: eight BLV participants used ViDscribe for at least 10 minutes of video per day over five days, with end-of-day micro-surveys and an end-of-week System Usability Scale and feedback survey.

Key findings

Participants watched 81 videos (average 3:37) and applied customizations to 63%% of them, asking 66 VQA questions total. Across daily ratings, customized ADs outscored the default on all three experience dimensions: effectiveness (M=4.32 vs 4.00), enjoyment (3.97 vs 3.45, largest gap), and immersion (4.06 vs 3.72); no significance tests were run given n=8 and unequal observations. Emphasis was the most-used customization, with 'general' dominant overall but 'character' spiking for Sports (3/5) and Film & Animation (4/14) videos and 'instructional' dominant for How-To (6/11). The free-form guideline box was used in 23.5%% of videos, typically to request character names or appearance detail. VQA questions clustered around describing characters and scenes and identifying colors, features, and presence; inferential audio-visual questions (e.g. who said a line) were the most common failure mode because the implementation only sampled nearby frames. The overall SUS score was 70.6 (above the 68 web-interface benchmark). The most striking longitudinal pattern: preferred description length dropped from 47.7 words on day one to 33.3 by the end of the week, and participants shifted toward longer intervals between descriptions - users learned they could extract enough from shorter, less-frequent AD and wanted less 'AI fluff.' Emphasis and subjectivity choices, by contrast, were stable personal preferences. Trust in AI descriptions was high (6 of 8 would recommend), which the authors flag as a risk given known MLLM hallucination issues.

Relevance

For accessibility practitioners and tool builders, the paper is a concrete argument and working example that AI-generated audio description should ship with user controls, not as a fixed output. The six-axis customization taxonomy (frequency, length, emphasis, subjectivity, color, free-form) is directly reusable as a spec for other video-accessibility products, and the decoupling of AD timing from content generation is a clean architectural pattern. Two findings matter for practice: genre-conditioned emphasis (instructional for how-to, character for drama) suggests default customization profiles by genre would be a low-effort win, and the week-over-week shift toward shorter descriptions implies that 'safe' verbose AI ADs actively degrade the experience once users gain familiarity. Caveats: n=8, one week, all participants blind rather than low-vision, all prior AI-description users, and the high trust ratings raise the hallucination-risk concern the authors themselves flag - a longer study or objective accuracy audit would strengthen the trust claims considerably.

Tags: video accessibility · audio description · blind and low vision · multimodal large language models · visual question answering · longitudinal study · personalization