AI4XR: AI in Extended Reality for 3D Scene Editing and Accessibility Design

Junlong Chen · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) — Doctoral Consortium · doi:10.1145/3772363.3799187

Summary

This CHI '26 Doctoral Consortium paper summarises Junlong Chen's PhD research at the University of Cambridge on integrating AI — specifically large language models (LLMs) and vision-language models (VLMs) — into extended reality (XR) workflows. The research covers two complementary directions: AI-assisted 3D scene editing and object selection for sighted XR users, and AI-driven visual accessibility features for blind and low-vision (BLV) users. Chen formulates three research questions: (RQ1) what are the advantages and disadvantages of AI-assisted speech-based interfaces compared with traditional XR interaction; (RQ2) what usability issues arise when current AI workflows encode 3D scenes as structured text, and how can they be mitigated; and (RQ3) what qualities emerge when multimodal AI is deployed in immersive environments. The methodology is user-centered: each sub-project pairs a research prototype against a baseline condition in a within-subjects study, with performance, usability, user experience, and NASA-TLX measures. Three prototypes are described. AssistVR combines raycast pointing with speech-driven object selection and bulk editing using Azure Conversational Language Understanding plus a downstream LLM. A follow-up study compares speech-and-pointing against a disocclusion minimap (DiscPIM) for selecting occluded objects across varying scene perplexity. EnVisionVR is a scene-interpretation tool targeted at BLV users that generates natural-language scene descriptions at preset 'anchor points' using a VLM, plus an importance-ranked main-objects function and a distance-proportional beeping object-localization function.

Key findings

The scene-editing study with AssistVR (N=12) surfaced two dominant interaction strategies: incremental exploration, where users inspected and edited one object at a time, and bulk modification, where speech was used to select and edit groups of objects sharing properties (for example 'make all blue objects purple'). Bulk modification is qualitatively difficult with direct-manipulation controllers and is a clear advantage of LLM-assisted XR. The occluded-object-selection study (N=24) found speech-and-pointing significantly faster than the disocclusion minimap when 2 or 4 targets shared referenceable colors/shapes (p<.05), and significantly lower task load overall (p<.05); however, users rated the minimap more fun, intuitive, and engaging, and reported a better sense of direct-manipulation control. The EnVisionVR study with 12 BLV participants yielded design guidelines: AI accessibility systems in XR should provide a hierarchy of description granularity (high-level scene summaries down to object-level detail), use anchor-based VLM calls to keep latency low, maintain a consistent spatial reference frame across modalities, and allow user customisation of verbosity. Participants also highlighted the need for continuous audio cues and binaural/3D audio to convey spatial depth in addition to speech descriptions.

Relevance

This is a useful map of where AI-in-XR research is heading for practitioners working on immersive accessibility. The EnVisionVR strand demonstrates a viable architecture for VLM-based scene interpretation in VR that avoids the high latency of calling a vision model on every frame: pre-compute descriptions at a grid of spatial anchors, then pick the nearest anchor at runtime. That pattern is portable to AR wayfinding, museum interpretation, and real-world scene readers on phones and smart glasses. The emphasis on a description hierarchy (scene summary → main objects → object localization) mirrors how sighted users visually scan a space, and is directly applicable to designing non-visual interfaces for 3D content. For standards and practice, the paper reinforces the case that multi-granular, user-customisable scene descriptions — not a single verbose caption — should be the target for accessible XR. Limitations: the work is early-stage and controlled; real-world deployment of VLM descriptions in dynamic AR scenes remains open, as do hallucination, privacy, and latency concerns.

Tags: extended reality · virtual reality · artificial intelligence · large language models · vision-language models · visual accessibility · blind and low vision · scene description · multimodal interaction · doctoral consortium