SignStreamNet: Streaming Sign Language Video-to-Text Translation for Accessibility

Warfa Ahmed · 2025 · Proceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2025) · doi:10.1145/3663547.3759422

Summary

This paper introduces SignStreamNet, a hybrid neural network architecture designed to translate sign language video into written text in near real-time. The system addresses a fundamental accessibility barrier: over 70 million Deaf and Hard-of-Hearing (DHH) people worldwide rely on sign languages that have distinct grammar and spatial structure from spoken languages, yet most communication and media remains in spoken/written form. Automatic sign language translation (SLT) can help bridge this gap, but previous systems required processing an entire video before producing any output, making them unsuitable for live communication scenarios. SignStreamNet combines two visual processing paths: a slow path using a 3D convolutional neural network (S3D) that captures motion and temporal dynamics by processing every fourth frame, and a fast path using a Swin Vision Transformer that extracts detailed spatial features from every frame. These two streams are merged through a learned gating mechanism that dynamically weighs the contribution of each path. The fused features then pass through a streaming Transformer encoder equipped with Monotonic Chunkwise Attention (MoChA), which allows the model to begin producing text output before the entire video has been received — a critical requirement for real-time use. An autoregressive Transformer decoder generates the final translated text. The model was evaluated on two benchmark datasets: the German Sign Language PHOENIX-2014T weather corpus (8,257 videos from 9 signers) and the Greek Sign Language (GSL) public service dialogue dataset (10,295 instances from 7 signers). Training was conducted separately for each language on NVIDIA L4 GPUs.

Key findings

On the GSL dataset, SignStreamNet achieved a BLEU-1 score of 76.6, establishing a new state of the art and outperforming the previous best result of 75.46. On the more challenging PHOENIX-2014T dataset, the model scored 36.56 BLEU-1 — lower than offline models like TwoStream-SLT (54.32) and SignBT (51.11), but notable given the streaming constraint. Ablation studies revealed that removing the 3D-CNN slow path dropped BLEU-1 from 36.56 to 28.0, confirming that motion features are essential. Disabling the streaming mechanism (allowing full-context attention) improved BLEU-1 by about 6.5 points, quantifying the accuracy cost of real-time operation. Latency benchmarks showed the model processes 64-frame clips in 53.98 ms on an NVIDIA L4 GPU (18.5 FPS) and 33.84 ms on an L40S GPU (29.6 FPS), well within real-time thresholds. Exploratory evaluation with DHH participants fluent in German Sign Language found the system responsive, though translation accuracy was inconsistent, particularly with complex sentences and fingerspelling. Common errors included mis-recognition of fast-paced signs and fingerspelled proper nouns.

Relevance

This work represents an important step toward making live communication accessible for DHH individuals through automatic sign language translation. The streaming capability is the key differentiator — while previous models achieved higher accuracy, they required the full video before producing output, making them impractical for real-time scenarios like video calls or live captioning. The architecture is language-agnostic and was demonstrated across two different sign languages, suggesting potential for broader deployment. However, significant limitations remain: the model was trained on controlled studio recordings with limited domains, accuracy drops notably on complex or open-domain content, and fingerspelling remains problematic. Real-world deployment would also require front-end signer detection and tracking. For practitioners, this paper illustrates both the promise and current limitations of AI-driven sign language translation as an accessibility tool — useful for constrained domains but not yet reliable enough for critical communication.

Tags: sign language translation · deaf and hard of hearing · real-time translation · deep learning · computer vision · streaming neural networks · assistive technology