Visualization of Speech Prosody and Emotion in Captions: Accessibility for Deaf and Hard-of-Hearing Users
Caluã de Lacerda Pataca, Matthew Watkins, Roshan Peiris, Sooyeon Lee, Matt Huenerfauth · 2023 · Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23) · doi:10.1145/3544548.3581511
Summary
This CHI 2023 paper tackles a dimension of captioning that has gone largely unaddressed for four decades: captions depict words but strip out the prosody and emotion carried by a speaker's voice. The authors argue that while automatic speech recognition (ASR) has reduced word error rates enough for many conferencing use cases, the resulting captions leave Deaf and Hard-of-Hearing (DHH) users in a flat, affectively dull stream of text that obscures mood, emphasis, sarcasm, and humour. The work is structured as two linked studies. Study 1 is a semi-structured interview study with eight DHH participants (recruited from DHH mailing lists and Reddit, some conducted in ASL with an interpreter) about their experiences of automatic captions in meetings with hearing peers, analysed with iterative thematic analysis. Study 2 is an online comparative study with 16 DHH participants who watched videos in four caption styles and rated them on 7-point Likert scales. The authors developed three novel prototype caption styles that map typographic parameters to speech features: a prosody style (loudness to font-weight, pitch to baseline shift, duration to letter-spacing), an emotion style (valence to color on a red-white-green scale, arousal to font-size, using the circumplex model of emotion), and a combined prosody + emotion style. Features were extracted from audio using Praat for prosody and a transformer-based neural network for valence/arousal, with forced alignment via Gentle to synchronise per-word timestamps.
Key findings
Interviews surfaced four themes: captions' dull ambiguity, communication as an uphill battle, reliance on multimodal signals (facial expressions, body language), and a sense that different contexts call for different solutions. Participants regularly felt excluded and some had naturalised this to the point of accepting they could not fully participate in certain conversations. In Study 2, the emotion-only style (E) significantly outperformed conventional captions on both clarity of emotions/moods (median 6 vs 4) and clarity of emphasis (median 5 vs 3), with statistically significant Mann-Whitney U differences. The combined prosody + emotion style (P+E) also outperformed conventional captions on emotion clarity. Surprisingly, the prosody-only style (P) did not outperform traditional captions at representing emphasis, despite being built on a model previously validated with hearing users. Legibility was lower for all three new styles (conventional scored median 7 vs E's 6), with baseline shift and letter-spacing specifically flagged as harder to read. Willingness to use E captions in work or personal meetings was comparable to traditional captions.
Relevance
For accessibility practitioners, this paper offers concrete evidence that captions' inexpressiveness is a genuine barrier, not a cosmetic concern, and that color- and size-based affect cues can meaningfully close the gap without tanking legibility. The color palette was deliberately tuned for deuteranomaly and protanomaly, illustrating the need to consider intersecting sensory differences when designing for DHH users. Limitations are real: red-green palettes still fail for severe color vision deficiency, sample sizes are small (8 and 16), stimuli were single-speaker monologues rather than multi-party meetings, and the proactive surfacing of a speaker's emotional state raises speaker-autonomy concerns the authors flag for future work. The work also undermines the CEA-608/708 status quo, where authoring tools remain stuck on the analog-era 608 standard.
Tags: captioning · deaf and hard of hearing · prosody · affective computing · videoconferencing accessibility · typography · automatic speech recognition · color accessibility · qualitative research
Standards referenced: CEA-608 · CEA-708