Tactile Emotions: Multimodal Affective Captioning with Haptics Improves Narrative Engagement for d/Deaf and Hard-of-Hearing Viewers

Caluã de Lacerda Pataca, Saad Hassan, Lloyd May, Michelle M Olson, Toni D'aurio, Roshan L Peiris, Matt Huenerfauth · 2025 · Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25) · doi:10.1145/3706598.3713304

Summary

This paper explores a multimodal approach to affective captioning — captions that convey not just the words a speaker says but the emotional dimensions of their voice — for d/Deaf and Hard-of-Hearing (dhh) viewers. Prior work has shown that typographic modulations such as font-color, font-weight, and font-size can effectively communicate valence (positive vs. negative tone) but consistently struggle to represent arousal (emotional intensity). The authors propose using haptic feedback on a wrist-worn voice-coil device to encode arousal, paired with typographic cues for valence, and ground the work in Russell's circumplex model of emotion, which maps emotional states onto valence and arousal axes. The methodology follows a two-study design. Study 1 is a formative investigation with 16 dhh participants comparing six haptic patterns (three rhythmic configurations crossed with two frequencies, 75 Hz and 250 Hz) using a best-worst scaling methodology and TrueSkill ratings. Study 2, with 27 dhh participants, took the winning pattern and compared five captioning conditions — a neutral baseline, visuals-only valence, visuals-only valence plus arousal, visuals for valence plus haptics for arousal, and visuals for valence alone — measured against the 12-item Busselle and Bilandzic Narrative Engagement scale. The technical pipeline combines OpenAI's Whisper for word-level timestamped transcription, Wagner et al.'s toolkit for per-word valence and arousal prediction, WebVTT output, the variable Recursive typeface for weight modulation, and a ChucK-generated audio signal driving an Acouve Vp2 vibro-transducer housed in a 3D-printed wristband.

Key findings

Study 1 established a clear preference: participants favored a single short pulse per word at 75 Hz to encode arousal, with amplitude modulated by intensity. Both longer continuous pulses and repeated multiple pulses scored lower, and the 250 Hz frequency was consistently described as physically uncomfortable — one participant compared it to 'scratching on a blackboard.' Study 2 found that the combined haptics-plus-visuals condition produced significantly higher narrative engagement than both the conventional captioning baseline (p = 0.02) and the visuals-only affective captioning style (p = 0.01). The combined condition also outperformed haptics-alone, indicating that intermodal integration matters — haptic signals by themselves did not carry enough perceptual salience to convey arousal. Qualitative themes were mixed. Many participants reported that haptics deepened empathy, spatial presence, and emotional connection to the speaker, with some saying they 'truly felt in the scene.' Others, however, found constant vibrations distracting or overwhelming, particularly during fast-paced dialogue or when felt intensity did not match visible facial expressions. Three tensions emerged from the thematic analysis: building emotional connection versus distraction, skepticism about affect as objective information (given subjective interpretation), and contextual fit — several participants suggested haptics might be best applied selectively, such as for emotional climaxes, horror, or sci-fi rather than continuously throughout a video.

Relevance

For captioning practitioners and accessible-media designers, this research directly challenges the assumption that captions need only transcribe words and identify non-speech sounds. It offers concrete, measurable evidence that the non-verbal emotional information dhh viewers lose — vocal tone, prosody, intensity — can be systematically added through typographic modulation paired with wearable haptics, and that doing so significantly increases narrative engagement. The work has practical implications for streaming platforms, broadcast captioning workflows, and the design of wearable accessibility technology: it points toward user-personalization (letting viewers toggle or tune haptic intensity), selective application (vibrating only at meaningful emotional thresholds), and careful consideration of genre and viewing context. Limitations include short-duration stimuli, a controlled lab environment, pre-generated non-live signals, and the single choice of font-weight as the arousal visual (font-size was not tested). Future practitioner-facing work should explore live captioning pipelines, multi-speaker scenes, variable screen sizes, and whether prolonged exposure shifts the balance between engagement and annoyance.

Tags: affective captioning · haptics · deaf and hard of hearing · multimodal interaction · captions · narrative engagement · accessibility research · wearable technology · typography · emotion

Standards referenced: WebVTT