Sonic Stage: Automatically Generating an Interactive Spatial Soundscape to Facilitate Dialogue Video Comprehension for Blind and Low Vision Viewers

Shuchang Xu, Xiaofu Jin, Gaurav Jain, Wenshuo Zhang, Huamin Qu, Brian A. Smith, Yukang Yan · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) · doi:10.1145/3772363.3798425

Summary

Xu and colleagues (HKUST, Columbia, Aalto, Rochester) tackle a well-known but largely unsolved problem in video accessibility: standard audio description (AD) is constrained not to overlap with dialogue, so dialogue-heavy scenes in films and TV - where characters' actions, movements, and facial expressions are often narratively crucial - are exactly the scenes where blind and low-vision (BLV) viewers get the least visual information. The paper's answer is to stop trying to squeeze more speech into shrinking gaps and instead convey visual information through a non-verbal, spatialized soundscape that plays *during* dialogue. A formative study with eight BLV viewers (co-watching three dialogue clips pre-augmented with spatialized audio, plus semi-structured interviews, reflexive thematic analysis) surfaced three distinct information needs with distinct presentation preferences: spatial layout (wanted as scene-anchored spatial audio that stays consistent across camera cuts), character actions (wanted as diegetic sound effects recognizable as accessibility cues, not confusable with the original soundtrack), and visual details like facial expressions (wanted on-demand through concise descriptions). Sonic Stage operationalizes all three. Its automated pipeline uses the Visual Geometry Grounded Transformer (VGGT) for 3D scene and character-trajectory reconstruction from sampled keyframes, then projects each speaker's dialogue onto an optimized 3D soundscape whose x-axis is aligned with the direction of greatest inter-speaker variance (maximizing left-right separability) and whose listener origin sits at the geometric center of the characters. A text-to-sound module generates diegetic effects for detected actions; a multimodal LLM produces interactive descriptions a user can trigger with a single tap. All delivered over headphones and a phone.

Key findings

This is primarily a system-and-formative-study paper; a full user evaluation is ongoing. The formative study's empirical contribution is the three-part taxonomy of dialogue-video information needs (spatial layout, character actions, visual details) with presentation preferences attached to each, which is directly actionable for designers of accessible video tools. Technically, the paper demonstrates that fully automated 3D-scene-anchored spatial audio is feasible for off-the-shelf dialogue video: VGGT-based reconstruction plus an optimized soundscape axis preserves spatial continuity across camera cuts, which prior screen-space spatial-audio approaches fail at. Preliminary results from a within-subjects comparison against SPICA (a state-of-the-art accessible video-exploration baseline) suggest that Sonic Stage helps BLV viewers intuitively understand characters' actions, movements, and visual details, and that videos feel more emotionally engaging. The authors observed that participants tended to trigger interactive descriptions in response to moving audio cues or salient sound effects - an interaction pattern worth replicating. Participants suggested extending the approach to documentaries, musical theater, and dance. Full quantitative results on comprehension, spatial presence, and narrative engagement are forthcoming.

Relevance

For practitioners working on film, TV, or streaming accessibility, Sonic Stage offers a concrete alternative to the 'more AD, faster narration' treadmill: use non-verbal spatial audio and diegetic sound effects to carry spatial and action information that speech cannot fit, and reserve interactive descriptions for on-demand detail. The three-part information-need taxonomy (layout / action / detail) with matched modalities is a clean design framework that could be applied beyond this prototype - for example to sports, live theater, or video chat. The 3D-scene-space spatialization approach is a notable technical step, because many earlier accessible spatial-audio efforts produced disorienting jumps on camera cuts. Caveats: the core evaluation is still pending as of this extended abstract, the formative study is small (n=8), and the approach depends on modern ML infrastructure (VGGT, text-to-sound, multimodal LLMs) that may not be deployable at streaming-platform scale yet. The open questions the authors raise - cognitive load from simultaneous audio layers, smooth transitions across scenes, and integration with haptics - are the right next steps.

Tags: video accessibility · audio description · blind and low vision · spatial audio · sound design · diegetic sound · multimodal large language models