Real-Time Depth-Camera Based Hand Tracking for ASL Recognition

Brandon Taylor, Anind Dey, Daniel Siewiorek, Asim Smailagic · 2017 · Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS) · doi:10.1145/3132525.3134777

Summary

This demonstration paper validates the use of a publicly available real-time hand tracking algorithm (Sphere-Mesh) for recognizing American Sign Language (ASL) handshapes using a depth camera. Sign Language Recognition (SLR) has long been a motivating goal for high-precision hand tracking systems, but technological limitations have historically forced trade-offs between the resolution needed to track individual finger positions and the field of view required to capture the signer's full body. Only recent improvements in depth cameras and computing power have made it feasible to maintain accurate finger tracking over an appropriately sized volume of space. The researchers at Carnegie Mellon University used the open-source Sphere-Mesh algorithm with an Intel SR300 depth camera to collect hand pose data from 10 participants performing the 24 static ASL alphabet handshapes (excluding J and Z, which require motion). The algorithm estimates hand pose using a 28-degree-of-freedom model, capturing global position, orientation, and individual joint angles in real time. Participants were seated at a desk, wore a yellow wristband for tracking, underwent manual calibration, and then performed each handshape prompted on screen. The system extracted the final hand pose from each recording sequence as the feature vector, and simple naive Bayesian classifiers were trained to recognize the handshapes.

Key findings

Using a leave-one-subject-out cross-validation procedure — where each participant in turn was treated as a previously unseen signer — the system achieved an average classification accuracy of 69.9% across the 24 static ASL alphabet signs. Individual accuracy ranged from 54.2% to 83.3% across users, indicating significant variation in how well the algorithm captured different signers' hand poses. This result is comparable to other state-of-the-art handshape classifiers that have achieved 47% to 84% accuracy on similar ASL alphabet datasets, but with the critical advantage that the Sphere-Mesh approach runs in real time. Previous comprehensive handshape classification efforts covering 77-82 handshapes achieved up to 84% accuracy but with processing times of 2.43 seconds per frame or unreported latency, making them impractical for real-time applications. The study focused only on the 24 static alphabet handshapes, leaving expansion to the full set of 40-50 ASL handshapes, as well as integration of movement, facial expression, and body position parameters, to future work.

Relevance

This work represents a step toward practical, real-time sign language recognition systems — a technology that could significantly improve communication accessibility for deaf and hard-of-hearing individuals. The use of a publicly available, open-source algorithm and consumer-grade depth camera makes the approach reproducible and potentially scalable. For accessibility practitioners, the key insight is that real-time hand tracking has reached a level of maturity where even simple classifiers can achieve meaningful recognition rates, suggesting that more sophisticated machine learning approaches could yield substantially better results. However, the 70% accuracy on just 24 static handshapes (out of the 40-50 used in ASL) with significant per-user variation highlights how far the field still needs to go before practical sign language translation is achievable. Real ASL involves dynamic movement, two-handed signs, facial grammar, and body positioning — all of which remain open challenges. The paper nonetheless validates depth-camera-based hand tracking as a viable foundation for future SLR research.

Tags: sign language recognition · hand tracking · computer vision · depth camera · machine learning · American Sign Language · fingerspelling · deaf and hard of hearing