The FATE Landscape of Sign Language AI Datasets: An Interdisciplinary Perspective

Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef, Christian Vogler, Meredith Ringel Morris · 2021 · ACM Transactions on Accessible Computing · doi:10.1145/3436996

Summary

This interdisciplinary paper examines the ethical landscape of AI datasets used for sign language recognition, generation, and translation technologies. Drawing on expertise from deaf community members, sign language linguists, and AI researchers, the authors apply the FATE framework (Fairness, Accountability, Transparency, Ethics) to analyze how sign language datasets are created, shared, and used. The paper addresses a critical gap: while AI technologies for sign language are rapidly advancing, there has been insufficient attention to the unique ethical considerations these datasets present. Sign language data is inherently biographic—video captures identity, location, and personal expression in ways text cannot—making privacy and consent particularly complex. The authors contextualize their analysis within the history of audism (discrimination based on hearing ability) and the deaf community's justified wariness of technologies developed without their input. They examine three categories of sign language AI: recognition (converting signing to text/speech), generation (producing signing via avatars), and translation (converting between signed and spoken languages). Each application type raises distinct FATE concerns, from perpetuating linguistic imperialism to enabling surveillance. The paper systematically analyzes dataset characteristics including content (signing format, labels, metadata), model performance implications, intended use cases, and five dimensions of ownership (physical, legal, monetary, cultural/linguistic, and perceived). Collection mechanisms—in-lab studies, remote participation, crowdsourcing, and social media scraping—each carry different FATE tradeoffs affecting who is represented and how consent is obtained.

Key findings

The research reveals that sign language AI datasets face unique FATE challenges distinct from other AI domains. Key findings include: (1) Current datasets underrepresent the diversity of deaf signers, with most collected from hearing students or interpreters rather than native deaf signers, skewing AI models toward non-native signing patterns. (2) The concept of "ground truth" labels is problematic for sign language—there is no single correct transcription, and labeling schemes often impose hearing-centric frameworks. (3) Ownership of sign language data is complex and multidimensional; legal ownership may rest with collectors while cultural ownership belongs to deaf communities, creating tension when data is shared or commercialized. (4) Collection mechanisms significantly impact representativeness: in-lab studies capture high-quality but narrow samples; crowdsourcing platforms often lack deaf signers; social media scraping raises consent issues. (5) Sign language generation (avatars) carries higher risk of community harm than recognition, as poor-quality generated signing could spread misinformation or mock deaf culture. (6) Many deaf community members prefer human interpreters over current AI alternatives, and forcing AI solutions may constitute a form of linguistic imperialism. (7) Transparency requires communicating in sign language, not just written consent forms, yet most dataset collection uses English-only documentation.

Relevance

This paper is essential reading for anyone developing AI systems involving sign language or deaf users. It establishes that ethical sign language AI development requires meaningful deaf community involvement from the outset—not as research subjects but as partners with decision-making power. Practitioners should recognize that sign language data is not simply "video of gestures" but carries cultural significance and identity markers requiring heightened privacy protections. The five-ownership framework provides a practical tool for evaluating data governance. Organizations collecting sign language data should ensure consent processes are accessible in sign language, consider cultural and linguistic ownership alongside legal rights, and avoid collection mechanisms that underrepresent native deaf signers. The paper warns against "disability dongles"—well-intentioned but unhelpful technologies created without community input. For accessibility professionals, the key takeaway is that technological capability does not equal community benefit; the deaf community's priorities and preferences must guide development decisions.

Tags: sign language · AI datasets · deaf community · FATE framework · machine learning · sign language recognition · sign language generation · sign language translation · data ethics · accessibility research

Standards referenced: GDPR