CapTune: Adapting Non-Speech Captions With Anchored Generative Models

Jeremy Zhengqi Huang, Caluã De Lacerda Pataca, Saelyne Yang Wu, Dhruv Jain · 2025 · ASSETS 2025: 27th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663547.3746346

Summary

CapTune is a system that enables customization of non-speech captions—descriptions of environmental sounds, music, and other audio cues—for Deaf and Hard of Hearing (DHH) viewers. Current captioning practices follow a one-size-fits-all model based on standardized guidelines like the DCMP Captioning Key, which fails to account for the diverse preferences and needs of DHH audiences. CapTune addresses this by introducing an "anchored generative model" approach where caption creators define safe transformation boundaries through concrete examples (anchors) across four dimensions: level of detail (concise to elaborate), expressiveness (neutral to evocative), sound representation method (descriptive text, onomatopoeia, or sensory quality-focused), and genre alignment (horror, comedy, drama). Viewers then use slider controls to interpolate between these creator-defined anchors, with GPT-4o-mini generating customized captions within the bounded transformation space. The system includes a Creator Tool for authoring anchors with edit-and-lock controls for fine-grained oversight, and a Viewer Client with a 10×10 grid interface and chat feature for exploring caption variations. The research involved two evaluation studies: Study 1 with 7 video creators who authored anchors for horror and comedy clips, and Study 2 with 12 DHH participants who customized captions using the Viewer Client. The methodology included think-aloud protocols, semi-structured interviews, and thematic analysis with interrater reliability measured via Cohen's Kappa (0.74 average).

Key findings

Creators quickly grasped the anchor-based transformation model and valued retaining creative control while leveraging AI automation, describing the workflow as balancing "accessibility and creative control." They were particularly attentive to semantic accuracy, flagging AI outputs that might "flatten" emotional nuance or mislead viewers about sound characteristics. DHH viewers responded positively overall, with most (9 of 12) reporting that customization deepened their emotional connection to content. However, viewers identified several tensions: seven found parameter tuning to be trial-and-error, six found the 10×10 grid overly dense, and participants struggled to differentiate effects of subtle parameter changes. A key finding was that caption preferences are highly context-dependent—varying by content type (documentaries vs. movies), scene pacing (action vs. slow scenes), and individual factors like hearing history, linguistic background, and viewing purpose. Many participants (7 of 12) expressed concerns about AI-generated captions being overly interpretive, potentially undermining their ability to form independent understandings of content. Participants proposed user profiles (9 participants), preview/comparison features (7 participants), and context-aware controls as improvements. The paper identifies five design directions for future captioning systems: context-aware granular adaptations, preference retention through user modeling, explainable transformations, cultural and linguistic adaptation, and semantically aligned user-controlled representations.

Relevance

This paper makes a significant contribution to media accessibility by reframing non-speech captions as dynamic, co-authored experiences rather than static text imposed by creators alone. For accessibility practitioners, it highlights that DHH audiences are not monolithic—caption preferences vary dramatically based on hearing history, cultural identity, language background, and viewing context. The anchored generative model approach offers a practical framework for balancing standardization with personalization, relevant to any organization producing captioned content. The findings about interpretive agency—viewers wanting captions that inform without imposing meaning—have broad implications for AI-mediated accessibility tools. The tension between information richness and cognitive load, particularly around caption density in fast-paced scenes, provides evidence-based guidance for captioning best practices. Limitations include the focus on short-form content (2-8 minute clips) and reliance on GPT-4o-mini which can produce inconsistent outputs.

Tags: closed captioning · non-speech information · caption customization · deaf and hard of hearing · generative AI · personalization · media accessibility · creator tools · large language models

Standards referenced: DCMP Captioning Key · WAI Captions/Subtitles Guidelines