Deep Learning Methods for Sign Language Translation

Tejaswini Ananthanarayana, Priyanshu Srivastava, Akash Chintha, Akhil Santha, Brian Landy, Joseph Panaro, Andre Webster, Nikunj Kotecha, Shagan Sah, Thomastine Sarchet, Raymond Ptucha, Ifeoma Nwogu · 2021 · ACM Transactions on Accessible Computing · doi:10.1145/3477498

Summary

This comprehensive study evaluates deep learning methods for translating sign language video directly to spoken/written text—critically, without requiring the intermediate step of gloss-based recognition (manual sign-for-sign transcription). The researchers systematically compare different input feature extraction methods (OpenPose body/hand/face keypoints, CNN embeddings from AlexNet and ResNet50, k-means pose clustering) combined with various neural machine translation architectures (basic sequence-to-sequence LSTM, attention-based seq2seq, transformer model, and reinforcement learning). The evaluation spans three distinct sign language datasets: German Sign Language (GSL, weather forecasts), American Sign Language (ASL, storytelling), and Chinese Sign Language (CSL, standardized vocabulary). Sign language poses unique challenges for machine translation because it is a visual-spatial language with five grammatical parameters—handshape, location, palm orientation, body movement, and facial grammar—that have no direct equivalents in spoken language. The study emphasizes that this technology is intended to facilitate communication support, not replace human interpreters.

Key findings

The transformer model combined with ResNet50 or OpenPose input features achieved the best performance on the controlled GSL dataset, outperforming all sequence-to-sequence variants with higher BLEU2-BLEU4 scores. Ablation studies revealed that hand keypoints are the most informative body features for translation, but combining hands with body and face landmarks significantly improves accuracy. OpenPose features proved more robust than CNN features for less controlled datasets (ASL, CSL) because they normalize for signer position relative to the camera. LSTM outperformed vanilla RNN and GRU for handling long-term dependencies. Reinforcement learning helped mitigate the "exposure bias" problem inherent in teacher-forcing training methods, improving BLEU scores by 1.5-5 points. A human oracle experiment showed that even expert ASL signers produced inconsistent captions for the same videos, highlighting the inherent challenge of sign language ground-truth collection and the contextual nature of sign interpretation.

Relevance

This research advances automatic sign language translation toward practical deployment by demonstrating that direct sign-to-text translation is feasible without gloss annotation—a significant barrier for scaling to new sign languages. For practitioners, the key architectural insight is that transformer models with pose-based features (OpenPose) offer the best balance of accuracy and robustness across varied recording conditions. The finding that dataset quality heavily influences results underscores the need for controlled, consistently annotated sign language corpora. The work also highlights important limitations: current models still struggle with the linguistic diversity and contextual nuance of natural signing, and the low agreement between human signers on video interpretation suggests that purely automated translation should complement rather than replace human interpreters. Future work should explore larger datasets and more expressive input features capturing non-manual markers like facial expressions.

Tags: sign language · machine translation · deep learning · transformer · neural network · computer vision · Deaf and Hard of Hearing · sequence modeling · pose estimation