Enhancing Caption Accessibility through Simultaneous Multimodal Information: Visual-Tactile Captions

Raja S. Kushalnagar, Gary W. Behm, Joseph S. Stanislow, Vasu Gupta · 2014 · Proceedings of the 16th International ACM SIGACCESS Conference on Computers & Accessibility (ASSETS) · doi:10.1145/2661334.2661381

Summary

This paper addresses a fundamental limitation of captions (subtitles) for deaf and hard of hearing (DHH) viewers: captions force viewers to split attention between reading text at the bottom of the screen and watching the visual action, inevitably causing them to miss information. Research shows DHH viewers spend 84% of their time reading captions and less than 16% watching scenes, whereas hearing viewers spend close to 100% watching scenes while simultaneously processing audio. The problem is compounded for non-speech information (NSI) — sounds like phone rings, footsteps, or objects falling — which are difficult to represent in text, lack standardised caption formats, and require the viewer to both read the caption description and locate the source in the scene. The researchers from the National Technical Institute for the Deaf at RIT developed two enhancements. Tactile captions transform auditory NSI into vibrotactile patterns delivered through a wrist-worn vibration motor (tactor), preserving the temporal envelope of the original sound — for example, a doorbell's two-peak pattern is replicated as two pulses of vibration. This allows viewers to feel non-speech sounds while continuing to watch captions, leveraging the fact that tactile perception has excellent temporal resolution and minimal overlap with visual processing. Visual-tactile captions add a synchronous visual overlay on the NSI source in the scene (coloured wavy lines with width proportional to amplitude and length proportional to duration), pausing the video briefly to direct attention to the sound's location. The system uses a Windows laptop, mBed NXP 1768 microcontroller, and ROB-08449 vibration motor connected via Bluetooth, with a C# program that scans captions for bracketed NSI descriptions and triggers corresponding vibration patterns from a pre-built library.

Key findings

Two studies were conducted with DHH participants from NTID at RIT. A preliminary study with 27 participants (mean age 19.8) established the baseline problem: participants recalled on average only 4.73 of 7 NSI events and could describe only 3.87 of the items causing the sounds, confirming that standard captions inadequately convey non-speech information. The main experiment with 21 new DHH participants (mean age 19.5, all with hearing loss from birth) compared three conditions: regular captions, tactile captions, and visual-tactile captions. Visual-tactile captions were significantly preferred over regular captions on ease of use ratings (4.7 vs 4.2, p<.01), while tactile captions showed no significant difference from regular captions. For recall accuracy, tactile captions showed a 21.3% increase over regular captions in describing NSI events (p<.01), and visual-tactile captions showed a 30.9% increase (p<.01). The most dramatic improvement was in locating NSI sources within the scene: tactile captions yielded a 9.7% improvement over regular captions (p<.01), while visual-tactile captions yielded an 84.64% improvement (p<.01). Notably, none of the visual-tactile caption users and only three of the tactile caption users showed decreased scores compared to baseline, indicating reliable benefits. Participants appreciated being able to feel sounds directly rather than reading abstract descriptions, with one noting: "Tactile captions let me feel the doorbell rather than just looking at the description: doorbell ringing." Some found tactile-only feedback distracting without the visual anchor, as they could not always connect the vibration to the scene content.

Relevance

This research challenges the implicit assumption that captions are a solved accessibility problem for DHH viewers. While mandated captioning (required in the US since 1991 under the ADA) has dramatically improved media access, the visual-only nature of captions creates an inherent attention-splitting problem that hearing viewers never experience. The paper makes a compelling case that the original audio-visual simultaneity of film should be preserved in the accessible version — and since DHH viewers cannot use audio, tactile feedback offers a viable second channel. For accessibility practitioners working on video and multimedia, the findings highlight specific deficiencies in how non-speech information is currently captioned: lack of standardised descriptions, inability to convey duration or intensity, and no indication of sound source location. The visual-tactile approach — briefly pausing to visually highlight the NSI source while simultaneously providing tactile feedback — produced the strongest results and represents a model for future enhanced captioning systems. As wearable haptic devices become more common (smartwatches, haptic wristbands), the technical barrier to deploying tactile captions has decreased significantly since this 2014 study. The paper also provides a fascinating historical perspective, tracing accessible cinema from silent films (which were inherently accessible to deaf viewers) through intertitles, talkies, and modern captions.

Tags: captioning · deaf and hard of hearing · haptic feedback · multimodal interaction · non-speech information · sensory substitution · multimedia accessibility

Standards referenced: ADA