Variable frame rate for low power mobile sign language communication

Neva Cherniavsky, Anna C. Cavender, Richard E. Ladner, Eve A. Riskin · 2007 · Proceedings of the 9th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '07) · doi:10.1145/1296843.1296872

Summary

This paper from the University of Washington MobileASL team — Neva Cherniavsky, Anna Cavender, Richard Ladner, and Eve Riskin — addresses a then-emerging problem: enabling Deaf people in the United States to hold real-time American Sign Language conversations over the cellular network using video phones, at a time when 3G coverage was patchy and the dominant constraint was no longer bandwidth but the battery drain caused by encoding, transmitting, receiving, and decoding video on a mobile device simultaneously. The authors' core idea is a variable frame rate driven by the conversational structure of sign language: lower the frame rate when the user is "just listening" (turn-taking means one party is usually idle) and potentially raise it during fingerspelling (where each frame matters more). They first run a Wizard-of-Oz user study with six native signers from the Deaf Community using a Sprint PPC 6700 PDA-style video phone, evaluating ten H.264-encoded videos at frame-rate combinations such as 10/0, 10/1, 10/5, 10/10 (signing/not-signing) and 5/5, 5/10, 5/15, 10/10, 10/15, 15/15 (signing/fingerspelling). They then describe three real-time techniques for automatic activity detection — pixel differencing, a Support Vector Machine over H.264 motion vectors and skin-detected face/hand macroblocks, and a confidence-weighted combination of the two — that decide on the fly when to drop the frame rate.

Key findings

Reducing the frame rate during "not signing" segments (down to as low as 1 fps and even a freeze-frame 0 fps) produced no statistically significant drop in any intelligibility, ease, annoyance, or willingness-to-use rating, while saving roughly 13–27% of the bit rate and a comparable share of processor cycles. Participants did note anecdotally that the 0 fps freeze-frame felt unnatural (uncertainty about whether the connection had been lost), so the authors recommend 1 or 5 fps as a practical floor that preserves some backchannel feedback. Conversely, raising the frame rate during fingerspelling was *not* the win the authors expected: participants strongly disliked any encoding with 5 fps for the signing portion (Q3 difficulty p<0.01, Q5 willingness p<0.01) and indicated they would rather have a uniformly higher rate than a boost only during fingerspelling — when the base rate was sufficient (≥10 fps), bumping fingerspelling to 15 fps had only marginal benefit. For the activity-detection problem, all three methods classified frames at 86–91% accuracy on a held-out video; the confidence-weighted combination of differencing and SVM was best on every test video, achieving 87.2–91.1% correctness with false-negative rates of 1.3–8.0%. False negatives (signing frames misclassified as not-signing) are flagged as the more harmful error type because the frame is then dropped.

Relevance

For accessibility practitioners working on real-time video communication for Deaf users — and more broadly for any team designing low-bandwidth or low-power video systems — this paper is a foundational demonstration that the linguistic and conversational structure of sign language can be exploited to lower resource consumption without losing intelligibility, provided the right activity is being preserved at the right frame rate. The general design principle (treat activity-aware variable encoding as a first-class accessibility consideration, not as a video-engineering afterthought) remains directly relevant to today's VRS, video relay, and mobile sign-language video apps. The paper is also a useful methodological example: a small but well-controlled Wizard-of-Oz user study with native signers, validated against three computer-vision activity classifiers built from real H.264 motion vectors. Limitations the authors flag are honest and worth taking forward: only six female participants, only one signer in the activity-detection videos, a stationary camera and uniform background that flatter the differencing baseline, no temporal modelling (which the authors propose addressing with HMMs), and a fingerspelling corpus that was heavily context-supported and may not generalise to unfamiliar names or technical terms.

Tags: American Sign Language · sign language · fingerspelling · Deaf community · video phone · mobile accessibility · MobileASL · video compression · frame rate · turn-taking · support vector machine · computer vision · low power · multimedia accessibility · deaf and hard of hearing