Fuzzy Feelings: Arousal's Interpretive Noise and the Case for Acoustic-Based Haptics
Caluã de Lacerda Pataca, Stephanie Patterson, Roshan L Peiris, Matt Huenerfauth · 2026 · CHI '26: Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems · doi:10.1145/3772318.3793421
Summary
This CHI 2026 paper from a team at Rochester Institute of Technology and Birmingham City University tackles a persistent gap in captioning: traditional captions carry words but strip the emotional tone, rhythm, and vocal affect that sighted hearing viewers absorb automatically. Prior work — including the authors' own earlier studies — has tried to fill this gap by modulating caption typography and adding haptic (vibration) feedback driven by a speech-emotion-recognition (SER) model's 'arousal' score. The authors argue this strategy rests on a shaky construct: arousal is poorly defined in psychology, conflated with loudness in perception, and computed by opaque transformer models prone to demographic bias. The paper reports a two-part mixed-methods study with 14 Deaf and Hard-of-Hearing (d/DHH) participants (ages 19-72, primary communication modes spanning ASL, Spoken English, and Lipreading) wearing a wrist-mounted Acouve Vp2 vibrotactile driver. Part 1 replicates an arousal-driven captioning and haptic system: per-word arousal values from a transformer SER model modulate caption font weight (Recursive typeface, 300-900 weight axis) and drive 75 Hz vibration intensity over 48 short video clips. Part 2 abandons emotion inference entirely, instead testing five direct acoustic-to-haptic mappings — UNFILTERED, PITCH-NORMALIZED, PULSE, SAWTOOTH, and PITCH-EXAGGERATED — that translate pitch, rhythm, and waveform features of the raw speech signal into vibration patterns. Participants rated how well each pattern conveyed four discrete emotions (anger, happiness, sadness, calmness) drawn from the RAVDESS emotional-speech corpus, with Friedman tests and post-hoc Wilcoxon comparisons to quantify differences.
Key findings
Part 1 exposed the 'fuzziness' of arousal as a construct. Participants held divergent mental models: some mapped stronger vibration to anger or excitement, others to neutral steadiness, and many conflated 'more vibration' with loudness rather than emotional energy. Three cross-cutting themes emerged: arousal is ambiguous and inconsistently understood; haptic meaning-making is subjective and phenomenological (rhythm, contrast, and 'feel' matter more than monotonic intensity); and cross-modal consistency between haptic, visual, and semantic cues is essential for trust — conflict caused confusion and reduced confidence in the system. In Part 2, Friedman tests found significant differences across the five acoustic mappings for all four emotions (anger χ²=23.20, p<.001; calmness χ²=13.80, p=.008; happiness χ²=18.39, p=.001; sadness χ²=9.77, p=.045). PULSE rated highest for high-arousal emotions (anger median 5.0, happiness 4.0); SAWTOOTH was associated with anger (median 5.5); PITCH-NORMALIZED worked best for calm, low-arousal speech (calmness 4.5, anger lower at 3.0); the raw UNFILTERED signal read as calm across the board (median 4.5 calmness). No single pattern dominated, confirming that multi-parameter composite mappings with contrastive texture outperform monotonic 'more-is-more' intensity dials. Participants consistently asked for user control over when haptics are active — citing social acceptability concerns in work meetings, medical appointments, and public spaces.
Relevance
This paper is immediately relevant to anyone designing captioning systems, media-accessibility tools, or affective wearables for d/DHH users. Its central theoretical move — dropping 'arousal' as a design target and shifting to directly perceivable acoustic features (pitch, rhythm, waveform texture) — offers a cleaner, computationally lighter path than SER-driven pipelines, and avoids the well-documented demographic biases in emotion-recognition models. The four concrete recommendations (PITCH-NORMALIZED for low arousal, PULSE for high arousal, SAWTOOTH for anger-type low-valence emotions, and avoiding haptics that conflict with facial or caption cues) give practitioners a starting palette for prototyping emotion-conveying captions. The insistence on user control — haptics should be opt-in and context-aware — reinforces a long-standing accessibility principle that 'helpful' multimodal cues can become unwanted signals in the wrong setting. Limitations include a modest sample (n=14, skewed older at μ=43), short stimulus clips (2-12 seconds), lab-only evaluation, and the absence of comprehension measures — future longitudinal and in-the-wild studies are needed to confirm whether these haptic patterns survive real-world distraction and habituation.
Tags: captioning · expressive captions · haptic feedback · vibrotactile · affective computing · speech emotion recognition · deaf and hard of hearing · DHH · arousal · valence · multimodal accessibility · prosody
Standards referenced: WCAG 2.1 · WebVTT