FAME: Exploring Expressive Facial Avatars for Lyrical and Non-Lyrical Music Visualization for d/Deaf Individuals

Suhyeon Yoo, Yifang Pan, Ashish Ajin Thomas, Karan Singh, Khai N. Truong · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3790402

Summary

This CHI 2026 paper investigates whether expressive facial avatars can carry the emotional and structural richness of music — rhythm, pitch, melody, lyrics, and emotion — to d/Deaf and Hard of Hearing (DHH) audiences in ways that captions and abstract visualizers cannot. The authors adopt an iterative, probe-based research-through-design approach grounded in the cultural model of disability and aural diversity perspectives. In Study 1 (formative, n=9 DHH participants recruited online), they presented photorealistic singing-head probes rendered with the VOCAL system and JALI across eight songs balanced by lyrical presence and emotional valence, then extracted three design requirements: combine captions and instrument cues with avatar performance; avoid photorealism in favor of stylized, emotionally-legible features; and convey melody in non-lyrical music through scat singing. Guided by these DRs, the team built FAME (Facial Avatar for Musical Expression), a stylized cartoon avatar that lip-syncs to lyrics, scat-sings non-lyrical melodies with synthesized 'da' syllables, maps emotion to Ekman-based facial expressions, encodes pitch in jaw and mouth geometry, drives rhythm through head and upper-body motion, highlights active instruments with pulsing icons, and supports toggleable captions. Study 2 was a two-phase mixed-methods evaluation with 12 DHH users comparing FAME against ViTune (baseline visualizer) and probing applications across four musical contexts. Sessions were conducted over Zoom with live captioning, chat, or ASL interpreters per participant preference.

Key findings

In the comparison phase, participants matched audio to FAME visualizations correctly 95.8% of the time versus 66.7% for ViTune (t(11)=2.57, p=.026), with errors on ViTune concentrated on lyrical songs where abstract visuals obscured lyric tracking. More participants preferred FAME for comprehension (6 vs. 3), while enjoyment preferences split evenly (5 vs. 5) — FAME won on emotion and lyric clarity, ViTune on structural clarity and efficiency. Across FAME features, lyrics (M=4, IQR=1.25), emotion (M=4, IQR=2.25), and rhythm (M=4, IQR=0.25) were rated most effectively conveyed; pitch (M=3) and especially melody (M=2, IQR=1.0) were weakest, largely because the repeated 'da' scat syllable felt monotonous and unnatural to several participants. Captions were consistently essential for lyric comprehension and reduced the cognitive cost of lip-reading — P6: 'lip reading is hard… It's not straightforward, so it's a lot of work looking at every part of the face.' Participants envisioned avatars in three social roles: performers (embodying musical energy), interpreters (paralleling ASL interpreters on stage), and companions (co-experiencing music at parties and karaoke). Key unmet needs included hands/body movements and sign language integration (central to Deaf cultural performance), varied scat syllables anchored in jazz-scat phonotactics, avatar customization to reflect identity or performer likeness, and layered backgrounds to add musical context. Photorealistic avatars in Study 1 triggered uncanny-valley responses; stylization improved legibility but required careful anchoring to voice source to avoid puppet-like feel.

Relevance

For accessibility practitioners working on media, broadcasting, or streaming, this paper offers a concrete counter-argument to the assumption that captions plus a spectrum visualizer are sufficient music access. It operationalizes Deaf-cultural and aural-diversity framings into a working system and provides a refined set of design requirements that can transfer to other avatar or sign-language-avatar projects: layer captions and visualizations flexibly, stylize for emotional legibility while anchoring animation in real vocal production, extend expression to upper body and hands, and treat scat as a rhythmic-melodic vocabulary that requires musical grounding rather than a single repeated syllable. Limitations are explicit: all 12 main-study participants were already active captioning/visualizer users recruited online, songs were high-valence/high-arousal (biasing toward expressive benefit), evaluation was Zoom-based rather than in-the-wild, and the pipeline relies on multi-step preprocessing. The paper is also notable for its positionality statement — an all-hearing research team explicitly engaging critical disability perspectives — which is useful reading for practitioners designing with, not merely for, Deaf communities.

Tags: Deaf music · DHH · music accessibility · facial avatar · music visualization · scat singing · captions · lip-sync · sign language · aural diversity · design probe · multimodal interaction · Deaf culture