μCap: Instrumental Music Captions for Deaf and Hard-of-Hearing Individuals

SooYeon Ahn, In-Chang Baek, KyungJoong Kim, Khai N. Truong, Jin-Hyuk Hong · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3790729

Summary

Ahn and colleagues introduce μCap (Music Captions), an automatic captioning system that makes instrumental music accessible to Deaf and Hard-of-Hearing (DHH) audiences by producing time-aligned, non-lexical textual renderings — syllable-like strings such as 'Tta-da-ding' or 'dun-dun tung' — augmented with simple visual effects (font-size shifts for loudness, baseline shifts for pitch). The authors argue that while vocal-song captioning has matured (lyrics are displayed), instrumental passages are typically dismissed with tags like '[instrumental music]' or '[Music]', leaving DHH viewers excluded from the affective content of classical, jazz, and orchestral works. The paper reports a multi-stage mixed-methods study. First, a preliminary online survey with 21 DHH and 21 hearing participants probed caption preferences and visual-mapping strategies for volume and pitch. Second, five structured expert group discussions (two audio engineers, three linguists, two musicians; 7 weeks, 100–135 min per session) derived a phonetic-like schema mapping transient/resonance decomposition of instrument sounds to Korean consonant–vowel combinations. Third, a retrieval-augmented generation (RAG) pipeline was implemented: librosa and Essentia extract audio features (pitch, tempo, RMS loudness, onset strength, envelope), a TensorFlow EffNet classifier predicts instruments, GPT-4o-audio generates a draft description, and Chroma DB retrieves similar human-annotated captions from a 3,060-clip Korean dataset to ground the final caption. Two user evaluations followed — User Evaluation 1 (n=20 DHH + 5 hearing) ranked five caption variants, and User Evaluation 2 (n=15 DHH) ran an ablation of μCap versus spectrogram, descriptive text, and no-text/no-visualization variants.

Key findings

In the preliminary survey, 16 of 21 DHH participants expressed interest in instrumental music, and captions were the most-preferred support modality (9/21) versus sign-language interpretation (2/21). DHH and hearing participants converged on font-size change as the best mapping for loudness and baseline shift as the best mapping for pitch. In User Evaluation 1, manual human-annotated captions ranked best (mean rank 1.85), followed by μCap-zero (2.70), μCap-mini (3.22), μCap (3.30), and a rule-based heuristic (4.18); μCap significantly outperformed the heuristic and μCap-zero/μCap-mini variants (Wilcoxon, Holm-corrected, p < .05–.01), indicating that both expert-informed guidelines and RAG contribute. 18 of 20 DHH participants reported an enhanced listening experience with μCap. In User Evaluation 2, μCap received significantly higher scores than descriptive text and spectrogram for conveying rhythm (Friedman p = .0039), and captions drove significantly stronger immersion than waveform visualization (M = 5.43 vs 4.19, paired t, p < .001). The ablation showed that removing text hurt instrument-sound awareness while removing visualization hurt loudness/rhythm perception — text and visualization capture complementary aspects. DHH participants also reported caveats: unfamiliar syllable sequences could feel distracting, and some wanted color, vibration, or instrument-specific captions. Drums were the most tractable instrument (transient→plosive, resonance→vowel); strings were the hardest (no clear convention).

Relevance

For accessibility practitioners working on media captioning, the paper offers an empirical case that instrumental music — historically treated as uncaptionable — can be rendered in text in ways DHH audiences actually value, if the captions abandon narrative description in favor of sound-mimetic (scat-like) transcription grounded in phonetic principles. The design implications are concrete: map loudness to font size and pitch to baseline shift; decompose percussive sounds into transient/resonance consonant–vowel pairs; offer synchronized text plus visual modulation rather than either alone; and treat captioning as an expressive, expertise-heavy task that benefits from RAG grounding rather than raw LLM generation. Practitioners in streaming platforms, live event captioning, and accessible media standards should consider whether existing caption specifications (CEA-608/708, WebVTT, IMSC) can carry pitch/loudness styling metadata. Key limitations: the study used Korean (a sound-based syllabic script well-suited to phonetic mimicry), evaluated only classical and jazz-adjacent orchestral music, relied on professional orchestral clips rather than live performance, used GPT-4o-audio as a closed dependency, and recruited DHH participants who were literate caption readers, leaving open questions about multilingual generalization (especially to alphabetic scripts like English), pop/electronic genres, and Deaf signers who do not rely on reading.

Tags: deaf and hard of hearing · DHH · captions · closed captions · music accessibility · instrumental music · sound visualization · retrieval augmented generation · RAG · large language models · generative AI · multimodal access · onomatopoeia · mimetic language · media accessibility · Korean