Data Representativeness in Accessibility Datasets: A Meta-Analysis
Rie Kamikubo, Lining Wang, Crystal Marte, Amnah Mahmood, Hernisa Kacorri · 2022 · Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '22) · doi:10.1145/3517428.3544826
Summary
This paper conducts a systematic meta-analysis of demographic representativeness in 190 accessibility datasets — datasets sourced from people with disabilities and older adults — spanning from 1984 to 2021. The authors examine how age, gender, and race and ethnicity are represented across ten communities of focus: Autism, Cognitive, Developmental, Health, Hearing, Language, Learning, Mobility, Speech, and Vision. The datasets encompass diverse data types including audio, video, text, motion, image, logs, and sensing data. The researchers leveraged IncluSet, a dataset surfacing repository, and analyzed publicly available documentation including academic publications, sharing sites, and supplementary files. Three annotators with varying backgrounds in accessibility and AI coded demographic metadata using both deductive and inductive approaches. The study is motivated by growing concerns that AI systems trained on unrepresentative data can produce unfair and discriminatory outcomes, and that this risk is amplified for people with disabilities who are already marginalized. The authors position accessibility datasets as having a dual role: they can help mitigate AI bias by increasing disability representation, but they can also cause harm if they perpetuate existing demographic imbalances or enable surveillance and re-identification of disabled people.
Key findings
The analysis revealed mixed results across demographic dimensions. For age, accessibility datasets showed diverse representation, with older adults well represented overall — 48.3% of datasets with age information included at least one person aged 65 or older. However, communities focusing on Autism, Developmental, and Learning disabilities notably lacked older adult representation, reflecting broader research gaps and discrimination at the intersection of disability and age. For gender, representation skewed toward men and boys (60.1%) across datasets that included gender information, with women and girls at 39.9%. This gap was especially pronounced in Autism and Developmental datasets, where the male-to-female ratio mirrored known diagnostic biases — current autism diagnostic criteria may fail to account for how autistic females "mask" traits, leading to underdiagnosis. Only 9 datasets (5%) reported race or ethnicity information, making racial representation the weakest dimension of the analysis. Those that did report race used inconsistent, often binary categories like "white" versus "non-white." The study also uncovered significant meta-level problems: 37.4% of datasets contained no demographic metadata at all, gender was almost universally reported using binary classification with only one dataset including an "other" category, the source of demographic labeling (self-report vs. researcher inference) was rarely disclosed, and there was no standardized documentation practice across the field.
Relevance
This research has critical implications for anyone building AI-infused accessibility tools or collecting data from disabled populations. The findings demonstrate that accessibility datasets — which are essential for training AI systems that work well for disabled users — carry the same demographic biases found in broader AI datasets, compounded by disability-specific risks around privacy, re-identification, and surveillance. For practitioners, the key takeaway is that increasing disability representation in AI training data is necessary but insufficient; attention must also be paid to intersectional representation across age, gender, race, and other dimensions within disability communities. The near-absence of race and ethnicity data is particularly alarming, as it means the field cannot even assess whether accessibility AI systems perform equitably across racial groups. The authors' recommendations for participatory data stewardship — involving disabled data contributors in decisions about how their data is collected, maintained, and shared — offer a practical framework for more ethical dataset practices. The study also highlights that binary gender categories in datasets can harm nonbinary disabled people through technology-enabled misgendering, urging researchers to adopt self-identification and non-binary options.
Tags: AI fairness · datasets · representation · diversity · inclusion · machine learning · intersectionality · data ethics · aging