Caption Royale: Exploring the Design Space of Affective Captions from the Perspective of Deaf and Hard-of-Hearing Individuals
Caluã de Lacerda Pataca, Saad Hassan, Nathan Tinker, Roshan Lalintha Peiris, Matt Huenerfauth · 2024 · CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems · doi:10.1145/3613904.3642258
Summary
This CHI 2024 paper from Rochester Institute of Technology and Tulane tackles a concrete design question: if we want captions to convey a speaker's emotion — not just their words — which typographic modulations should we use? Prior work had established that affective captions can help d/DHH viewers who otherwise miss the paralinguistic cues (tone, rhythm, loudness) hearing audiences absorb automatically, but the literature offered no systematic comparison of which visual styles actually work. The authors run three sequential studies with 39 total d/DHH participants, framed explicitly around the circumplex model of emotion (valence × arousal). Study 1 evaluates nine individual typographic styles — font-color, background-color, shadow-color, font-weight, baseline-shift, letter-spacing, font-size, opacity, and an emotional typeface — each conveying either valence or arousal, using Best-Worst Scaling with TrueSkill Bayesian ranking across 560 pairwise comparisons. Study 2 combines the top performers from Study 1 into six two-parameter styles that depict both dimensions simultaneously (e.g. font-color for valence paired with font-weight for arousal). Study 3 compares the four Study-2 finalists against an unstyled baseline using an emotion-recognition task (participants place a speaker's expressed emotion on an EmojiGrid — a valence/arousal 2D plane of emoji) and three NASA-TLX-style Likert items measuring ease of reading, distraction, and mental demand. Stimuli were short clips from the Stanford Emotional Narratives Dataset processed with the Gentle forced aligner and a transformer-based valence/arousal network to generate per-word ground-truth affective values.
Key findings
Study 1 narrowed the nine candidate styles: for valence, font-color (TrueSkill μ=30.6) and shadow-color (μ=27.9) emerged as top choices; for arousal, shadow-color (μ=29.0), font-size (μ=27.7), font-color (μ=26.8), and font-weight (μ=26.6) tied. Split-half reliability was strong (ρ=0.92 valence, ρ=0.90 arousal). Study 2 found a four-way tie among font-color paired with font-weight, font-size, or shadow-color (valence first, arousal second), plus shadow-color with font-color — participants preferred font-color carrying valence to shadow-color carrying valence. Participants articulated four decision criteria: ease of reading, low distraction, intuitiveness, and clarity of emotional representation. Study 3 — the objective emotion-recognition test — revealed that only font-color-with-font-weight (Holm-Šidák-adjusted distance correlation 0.21, p<0.001) and font-color-with-font-size (r=0.14, p<0.05) significantly outperformed the baseline at conveying emotion; the two shadow-color styles did not. Subjectively, font-color-with-font-size rated highest for 'I understood the speaker's emotions' (median 6/7, p<0.01 vs baseline). However, font-color-with-font-size was also judged the most mentally demanding, suggesting the clarity-vs-distraction trade-off is unavoidable. All affective styles scored lower than baseline on legibility, so any deployment must accept a small readability cost. The authors recommend font-color for valence paired with font-weight (lower cognitive load) or font-size (higher emotional clarity) for arousal as design choices users should be able to pick between.
Relevance
This paper is a direct design playbook for anyone shipping affective captions in streaming, conferencing, or classroom-captioning products. The two recommended styles — font-color for valence combined with either font-weight or font-size for arousal — give engineering teams a narrow, validated starting palette rather than requiring them to re-run empirical work. The paper also documents several design traps: shadow-color tested well in preference studies but failed at the objective emotion-recognition task, showing that preference is not sufficient evidence for deployment; the emotional typeface and baseline-shift styles scored poorly on both axes; and every affective variant hurt legibility versus plain captions, so products should offer an off switch. Method-minded readers will find value in the three-stage funnel (Best-Worst Scaling → TrueSkill → EmojiGrid emotion-recognition test), which is reusable for evaluating any dense visual-accessibility design space. Limitations include the ASL/English North American d/DHH sample (colour-emotion associations may differ in other cultures, e.g. red reads positively in China), short pre-processed clips (real-time and longer-form use unknown), and that the study does not test user-configurable ranges — which participants repeatedly asked for. A follow-up line from the same group has extended this work into haptic modalities (see de Lacerda Pataca et al. 2026).
Tags: captioning · affective captions · expressive captions · typography · valence · arousal · deaf and hard of hearing · DHH · circumplex model · variable fonts · best-worst scaling · accessibility design
Standards referenced: WCAG 2.1