Sharing Practices for Datasets Related to Accessibility and Aging

Rie Kamikubo, Utkarsh Dwivedi, Hernisa Kacorri · 2021 · The 23rd International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2021) · doi:10.1145/3441852.3471208

Summary

This paper presents a systematic review of 137 accessibility datasets collected from people with disabilities and older adults over a 35-year period (1984-2020). The authors undertook an extensive two-year search process using a multilayer strategy: open searches on search engines, focused searches in repositories and publication venues (including Kaggle, UCI, VisualData, and major digital libraries like ACM and IEEE), and targeted searches on authors known to share datasets. The datasets were manually coded across multiple dimensions including communities of focus, data types, sample sizes, collection purposes, terminology used to describe contributors, and sharing practices. The authors organized datasets into ten community groups: Autism, Cognitive, Developmental, Health, Hearing, Language, Learning, Mobility, Speech, and Vision. The Hearing group (predominantly Deaf and hard-of-hearing communities contributing sign language data) had the most datasets, followed by Cognitive and Mobility groups. Surprisingly, the Vision community, which has received disproportionate attention in accessibility research overall, was not well represented in available datasets. The review reveals that accessibility datasets are difficult to locate, scattered across personal websites, footnotes, and supplementary materials rather than centralized repositories. To address this discoverability problem, the authors launched IncluSet, a data surfacing repository that uses Google Schema markup to make accessibility datasets findable through broader search engines without requiring researchers to upload their data.

Key findings

Of the 137 datasets analyzed, only 52 (38%) can be directly downloaded, 27 (20%) are available upon request, and 58 (42%) show no clear sharing intent or information. The most common sample size was just 3 participants, with a median of 20 contributors across datasets from communities of focus. Seven datasets had more than 1,000 participants, typically involving remote data collection through apps or assistive technologies deployed in the real world. Data types varied significantly: text annotations appeared in 94 datasets, video in 48, audio in 39, motion data in 35, logs in 26, images in 23, and sensing data in 23. Only 49 of 137 datasets reported ethical board clearance, with publicly available and downloadable datasets actually showing higher rates of IRB reporting than non-shared ones. The paper highlights a critical tension: the same datasets collected to mitigate bias or support assistive technology could be weaponized to detect disabilities, creating discrimination risks in healthcare and employment. Communities with "invisible disabilities" (Developmental, Learning) had particularly low rates of data sharing, likely reflecting heightened sensitivity around disclosure. Terminology used to describe data contributors was inconsistent and sometimes problematic, with terms like "high functioning" appearing in autism research and ambiguity around whether sign language "signers" were actually Deaf.

Relevance

This paper is essential reading for anyone working with AI and machine learning in accessibility contexts. It exposes a fundamental infrastructure problem: the datasets needed to build fair, inclusive AI systems are scarce, hard to find, and often poorly documented. For accessibility practitioners, the key takeaway is that data representation matters enormously for AI fairness. If datasets sourced from people with disabilities are small, inconsistent, or unavailable, the resulting AI models will perpetuate or amplify existing biases. The paper's discussion of privacy risks specific to disability communities is particularly important, as people with distinct data patterns from smaller populations face elevated re-identification risks. The ten-community framework provides a useful lens for auditing dataset coverage across disability groups. Organizations developing AI-powered assistive technologies should consider both the ethical obligations and practical challenges of collecting, documenting, and sharing accessibility data responsibly.

Tags: datasets · machine learning · data sharing · privacy · ethics · bias mitigation · AI · disability data · aging · systematic review

Standards referenced: IDEA Act · Americans with Disabilities Act