RAVEN: Realtime Accessibility in Virtual ENvironments for Blind and Low-Vision People

Xinyun Cao, Kexin Phyllis Ju, Chenglin Li, Venkatesh Potluri, Dhruv Jain · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791616

Summary

RAVEN is a GenAI-powered system enabling blind and low-vision (BLV) users to query and modify 3D virtual environments in real time through natural language. Rather than relying on developer-defined static accessibility features (such as fixed audio descriptions or color overlays), RAVEN shifts control to the user, allowing arbitrary modification prompts such as "Make the sign text bigger," "Move the bench closer to me," or "Brighten the streetlamp." The system integrates four components: a text-to-speech self-voicing interface; a Dynamic Information Retriever that builds an accessibility-augmented semantic scene graph encoding object positions, colors, sizes, audio properties, and egocentric spatial relations on every prompt; a Prompt Constructor fusing user input with accessibility support and error-prevention instructions; and GROMIT, an open-source runtime behavior generation system that generates and executes Unity C# code to apply modifications. GPT-4o powers language and code generation. The design evolved iteratively from a pilot study with three BLV participants. Key refinements included removing keyboard-based shortcuts in favor of conversational-only interaction, adding egocentric spatial language (in front of, to the left, behind) in place of raw 3D coordinates, and adding prompt-engineering guardrails to handle hallucinations, vague requests, and out-of-scope queries gracefully. The system was evaluated in a primary study with eight BLV participants across three scenes of increasing complexity, and a preliminary usability study with six Unity developers.

Key findings

Of 336 valid prompts analyzed, 75.3% produced correct results, 22.0% failed (14 intent errors where the LLM misunderstood user goals; 60 technical errors involving hallucination or code failure), and 2.7% were correctly flagged as out of scope. Average response time was 3.1 seconds. Participants rated the system highly on confidence (M=4.1), intuitiveness (M=4.3), and usability (mean SUS=79.7, an A- grade). Audio Volume and Object Location were most used and valued; color-based modifications benefited low-vision users substantially but were of limited relevance to blind participants. Four emergent prompt categories appeared beyond the six pre-designed ones: Scene Description, Semantic Description, Functionality, and Creation/Deletion. Semantic descriptions (e.g., "Which cat seems happiest?") were frequently used and valued for probing higher-level scene qualities. Functionality and Creation/Deletion had low success rates, reflecting current system limits rather than lack of user interest. Three user goal groups emerged: Exploration, Execution, and Verification. Exploration was the entry point for nearly all tasks, demonstrating the centrality of spatial awareness-building for BLV users in unfamiliar environments. Unity developers found RAVEN learnable (M=4.7) and highly promising for accessibility (M=4.7) but raised concerns about LLM errors and tagging burden at scale. A critical design insight: accessibility needs proved highly individualized across the visual-ability spectrum, reinforcing that static developer-defined presets cannot substitute for dynamic, user-directed adaptation.

Relevance

RAVEN represents a significant step toward personalized, user-directed accessibility in interactive 3D environments, which are increasingly important for gaming, education, training, and social participation. For accessibility practitioners, the system demonstrates both the promise and current limits of LLM-driven approaches: natural language interaction genuinely empowers BLV users to adapt environments to their individual needs, but hallucination, unreliable verification, and trust remain significant barriers that require guardrails and human oversight. The finding that accessibility needs are highly individualized and cannot be met by static developer-defined presets challenges a core assumption in AT design. It argues for treating accessibility as an ongoing, conversational process rather than a one-time configuration. The conversational programming paradigm has potential well beyond 3D games: screen readers, web content, educational software, and productivity tools could benefit from similar approaches that let users dynamically modify accessibility behaviour without developer intermediation, extending user agency in ways that fixed guidelines cannot anticipate.

Tags: blind and low vision · virtual environments · generative AI · 3D accessibility · natural language interaction · conversational programming · game accessibility · large language models · runtime accessibility