Identifying Sign Language Videos in Video Sharing Sites

Frank M. Shipman, Ricardo Gutierrez-Osuna, Caio D. D. Monteiro · 2014 · ACM Transactions on Accessible Computing · doi:10.1145/2579698

Summary

This paper addresses the challenge of finding sign language videos within general video sharing platforms like YouTube. While these platforms contain growing libraries of sign language content created by deaf community members, locating this content is difficult because text-based search depends on accurate tagging and metadata. The researchers first quantified the problem by searching YouTube for sign language videos on the top 10 news topics of 2011, finding that only 42% of results for queries including "ASL" or "sign language" were actually videos in sign language—nearly half were false positives including music videos with "sign language" in titles, videos about sign language recognition, or content tagged with ambiguous acronyms (ASL also means "age/sex/location"). To address this, the researchers developed a video-based classifier using computer vision techniques. The system extracts five visual features from video frames using background modeling to isolate moving objects and face detection to locate the signer: total activity, spread of activity, continuity of motion, symmetry of motion relative to face position, and amount of non-facial movement.

Key findings

An SVM classifier using all five visual features achieved 82% precision and 90% recall when tested against a challenging collection of 192 videos that included deliberately selected likely false positives (news presenters, weather forecasters, and others who gesture frequently). The most discriminative single feature was VF4—the symmetry of motion relative to the face—which alone outperformed the other four features combined with 76% precision and 84% recall. This finding reflects the characteristic bilateral hand movements in signing that differ from typical one-handed gesturing. The classifier required only 15 training examples per class to achieve strong performance. When applied as a filter to text-based search results, the video classifier could improve precision from 42% to an estimated 75%. Classification failures occurred with poor illumination, signers positioned far from camera, busy backgrounds, and backgrounds matching the signer's skin tone.

Relevance

This research demonstrates how computer vision can improve information access for the deaf and hard of hearing community by making sign language content more discoverable. The approach is notable for focusing on detection rather than translation—a simpler problem that provides immediate practical value. For video platforms, implementing such a classifier could enable sign language content filters, improving search experiences for deaf users without requiring perfect metadata from content creators. The findings also inform how sign language video should be recorded for optimal automated detection: good lighting, plain backgrounds, and signer positioned clearly facing the camera. As video sharing continues to grow as a communication medium, automated techniques for identifying sign language content become increasingly important for ensuring deaf community members can find relevant content in their preferred language.

Tags: sign language · ASL · video classification · machine learning · computer vision · deaf and hard of hearing · video sharing · information retrieval · SVM