← All reviews

A11y-CUA Dataset: Characterizing the Accessibility Gap in Computer Use Agents

Ananya Gubbi Mohanbabu, Rosiana Natalie, Brandon Kim, Anhong Guo, Amy Pavel · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791896

Summary

A11y-CUA is an open dataset and benchmark designed to expose the accessibility gap in Computer Use Agents (CUAs) — AI systems like OpenAI Operator, Anthropic Computer Use, and Microsoft Copilot that operate computers by taking screenshots and performing mouse/keyboard actions. Because CUAs are designed to mirror sighted users, they do not reflect the workflows of blind and low-vision users (BLVUs) who rely on screen readers, magnifiers, and keyboard navigation, creating a fundamental collaboration gap. The dataset comprises 40.4 hours of interaction traces and 158,325 events collected from 16 participants (8 sighted users, 8 BLVUs) completing 60 everyday Windows tasks across five categories: browsing and web, system operations, document editing, workflow, and media. Tasks span single-application and multi-application workflows with verifiable end states, drawn from how-to sources and adapted from OSWorld. To capture this data, the authors built an open-source computer use recorder that synchronizes multiple streams: screen video, system audio, OS-level input events (keystrokes, mouse, scroll), window and element context, UI Automation (UIA) snapshots, per-tab DOM and accessibility tree data from Chrome, and accessibility settings. The recorder generates replayable traces suitable for accessibility analysis, simulation, and agent benchmarking. Beyond the dataset, the paper evaluates state-of-the-art CUAs (Claude Sonnet 4.5 and Qwen3-VL-32B) under three conditions: Default-CUA (full mouse and keyboard access), Screen-Reader-CUA (keyboard-only), and Magnifier-CUA (150% zoomed viewport). These conditions simulate the constraints real BLVUs operate under.

Key findings

Sighted users completed tasks with 99.16% success in an average 92.35 seconds, overwhelmingly using mouse actions (51.93 mouse vs. 21.06 keyboard actions per task). BLVUs completed tasks with 84.6% success in 211.18 seconds — statistically significantly slower and less successful — but used exclusively keyboard actions (179.45 per task, zero mouse). BLVUs showed three recurring navigation strategies: sequential walking via Tab/Arrow keys, chunk-jumping via Ctrl/Shift shortcuts, and ribbon routes via Alt/Win menus. BLVUs exhibited a consistent verify-before-commit routine: waiting for screen reader feedback, then re-performing target actions to confirm outcomes. CUA performance dropped dramatically under AT conditions. Claude Sonnet 4.5 achieved 78.3% success under Default-CUA but only 41.67% under Screen-Reader-CUA (keyboard-only) and 28.3% under Magnifier-CUA (zoomed viewport). Qwen3-VL achieved 20% under default and 0% under both AT conditions. Workflow tasks suffered worst (Default: 33.33% vs. BLVU humans: 65.62%). The authors characterize three gaps: (1) perceptual — CUAs lack access to screen reader announcements, ARIA state changes, or off-viewport content under magnification; (2) cognitive — weak tracking of task state and application context across multi-step workflows; (3) action — under-use of keyboard shortcuts and over-reliance on fragile drag-and-drop. CUAs frequently complete intermediate steps but omit final confirmation actions (Save, Submit), revealing weak end-state validation.

Relevance

This paper is critically relevant as AI agents move toward operating computers autonomously on behalf of users. If Computer Use Agents are deployed without accessibility awareness, they risk encoding sighted-user interaction styles as the default, leaving BLV users unable to monitor, collaborate with, or share control of these agents in the way sighted users can. For accessibility practitioners, the paper provides three actionable takeaways. First, one interaction style does not serve everyone: agent design must accommodate keyboard-dominant, screen-reader-mediated workflows as first-class paths, not as degraded fallbacks. Second, the verify-before-commit routine that BLVUs consistently employ should be modelled into agent behaviour; agents that complete intermediate steps without final confirmation are unreliable for all users, not just BLV users. Third, the open dataset itself is a resource: evaluation teams can benchmark AT performance against real human traces rather than relying on simulated or synthetic baselines. For organizations like CNIB, the research also has advocacy implications: the measurable performance gaps documented here (78% vs. 28% under magnifier conditions) provide concrete evidence to demand accessibility evaluation as part of AI agent development and procurement.

Tags: blind and low vision · computer use agents · large language models · assistive technology · screen readers · magnifiers · dataset · benchmark · keyboard navigation · agentic AI