Design and Evaluation of Classifier for Identifying Sign Language Videos in Video Sharing Sites

Caio D.D. Monteiro, Ricardo Gutierrez-Osuna, Frank M. Shipman · 2012 · Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2012) · doi:10.1145/2384916.2384950

Summary

This paper presents the design and evaluation of a video classifier that automatically distinguishes sign language (SL) videos from non-sign-language videos on video sharing sites like YouTube. Currently, deaf and hard-of-hearing users must rely on tags, titles, or metadata to find SL content, but these are often inaccurate or absent — a YouTube search for "sign language" returns many irrelevant results including songs with the phrase in their title, videos about sign language research, and videos referencing "language in signs." The classifier uses dynamic background modeling (a running average with 0.04 learning rate) to separate foreground movement from the background, combined with Haar-feature-based face detection. From the foreground motion relative to the detected face, five video features are extracted: VF1 (quantity of motion — average foreground pixels per frame), VF2 (spatial distribution of motion — percentage of pixels in foreground for at least one frame), VF3 (continuity of motion — frame-to-frame foreground difference), VF4 (symmetry of motion relative to the face center), and VF5 (percentage of frames with non-facial movement). These features were designed to distinguish SL signing from other human motion like gesturing politicians, weather forecasters, dancers, or mimes. A Support Vector Machine classifier was trained on these features.

Key findings

Evaluated on a challenging test collection of 192 videos (98 SL videos in ASL and British Sign Language, 94 deliberately chosen near-false-positive non-SL videos), the SVM classifier achieved 82% precision and 90% recall with all five features. The most discriminating single feature was VF4 (symmetry of motion relative to the face), which alone achieved 75.95% precision and 83.69% recall (F1=0.80) — outperforming the other four features combined. This makes intuitive sense: many signs are made with both hands in symmetric positions relative to the body, creating a distinctive symmetric motion pattern. Removing VF4 dropped precision by 9% and recall by 12%. The other four features showed overlapping information content, with no strong individual effect when removed. The classifier performed well even with small training sets (15 videos per class yielded >81% precision and >86% recall), and performance was expected to be considerably higher on real video sharing sites where the test collection was intentionally constructed with likely false positives. Failure cases included presenters gesturing while facing the camera, signers too far from the camera for face detection, signing in front of busy backgrounds, and backgrounds matching the signer's skin tone.

Relevance

This paper addresses a fundamental information access problem for the deaf community: finding sign language content among the vast collections on video sharing platforms. For many deaf individuals — particularly those who grew up with ASL as their first language — the median reading comprehension level is at a 4th-grade level, making written English content on the internet difficult to access. Sign language video content is therefore a critical information resource, but it is buried among millions of videos with no reliable way to filter for it. For accessibility practitioners, this work highlights that content discoverability is itself an accessibility challenge — making content accessible is pointless if users cannot find it. The symmetry-of-motion finding provides an elegant, language-independent feature that could work across different sign languages (ASL, BSL, etc.). While modern deep learning approaches would likely outperform this SVM-based classifier, the fundamental problem of SL video identification and the feature engineering insights remain relevant as video platforms continue to grow.

Tags: sign language · deaf community · video classification · computer vision · machine learning · SVM · video sharing · YouTube · ASL · British Sign Language · content discovery