Do You See What I See? Designing a Sensory Substitution Device to Access Non-Verbal Modes of Communication
M. Iftekhar Tanveer, A. S. M. Iftekhar Anam, Mohammed Yeasin, Majid Khan · 2013 · Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility (Assets '13) · doi:10.1145/2513383.2513438
Summary
This paper presents iFEPS (improved Facial Expression Perception through Sound), a visual-to-auditory sensory substitution device that enables blind users to perceive their conversation partner's facial expressions through audio feedback. Research suggests that 80% of information in social communication is conveyed non-verbally through facial expressions, hand gestures, body pose, and proximity — all inaccessible to blind individuals. iFEPS uses a smartphone camera to capture the interlocutor's face, transmits frames to a server that extracts four facial features using a Constrained Local Model face tracker (tracking 64 landmark points): eyebrow height (BrowH), eye opening (EyeO), distance between lips (LipH), and distance between lip corners (LipCD). These features map to Facial Action Coding System (FACS) Action Units covering brow raising, squinting, blinking, mouth opening, smiling, and lip pursing. The system detects "up" and "down" events in these features and provides audio feedback. Development followed two phases: an initial "in-lab" design using short tonal feedback (300ms tones at different pitches), followed by participatory design with three blind users from the Clovernook Center for the Blind in Memphis, TN, who formed a design team meeting weekly. Based on their feedback that tonal sounds were too cognitively demanding (reaching at most 85% accuracy), the system was redesigned to use speech feedback (e.g., "mouth open", "eyebrow up"), which approached 100% recognition accuracy.
Key findings
Evaluation with 14 participants (7 blind, 7 blindfolded sighted, ages 24-63) across three half-hour sessions per phase showed dramatic improvement from the tone-based to speech-based prototype. Phase 1 (tones) averaged 84.17% accuracy in identifying facial events; Phase 2 (speech) averaged 97.72% — a statistically significant improvement (p < 0.01 by both Welch's t-test and paired t-test). The speech feedback was universally easier to learn and distinguish, though it introduced a tradeoff: speech takes longer to dispatch than tones, which could suppress detection of rapid consecutive facial movements. Subjective evaluation by the 7 blind participants scored 4.02/5.0 overall, with high marks for learnability and ease of distinguishing feedback. However, participants raised significant social concerns: pointing a smartphone camera at someone's face during conversation felt awkward, tiring (from holding the phone), and could attract unwanted attention. Users suggested glasses-embedded cameras (like Google Glass) as a more discreet form factor. Other design lessons included: the system should work locally without Wi-Fi dependency; eye blinks were distracting non-informative events that should be filtered; and users wanted a wider range of non-verbal cues including head nods/shakes, gender, ethnicity, and identity information. An important design decision was to convey facial expressions rather than inferred emotions, since expression-to-emotion mapping varies culturally and contextually.
Relevance
This paper tackles a fundamental social accessibility challenge: blind people's inability to perceive the non-verbal communication that dominates face-to-face interaction. This affects everything from casual conversation to job interviews to romantic relationships. The participatory design process yielded critical insights — the shift from tonal to speech feedback based on blind users' feedback dramatically improved accuracy and illustrates why involving the target population in design is essential, not optional. The decision to focus on facial expressions rather than emotions is a thoughtful design choice that avoids the significant problems of automated emotion recognition (cultural bias, low accuracy, ethical concerns). For practitioners, the social acceptability findings are as important as the technical results: even a perfectly functioning system will fail if users are embarrassed to use it. This validates the growing focus on socially acceptable assistive technology design. The work anticipates modern AI-powered visual description tools like Be My Eyes and Seeing AI, while highlighting the specific challenge of real-time, continuous non-verbal communication access that static image descriptions cannot address.
Tags: sensory substitution · blind users · facial expression · non-verbal communication · computer vision · sonification · participatory design · smartphone · social interaction · face tracking · FACS