From Struggle to Success: Context-Aware Guidance for Screen Reader Users in Computer Use

Nan Chen, Jing Lu, Zilong Wang, Luna K. Qiu, Siming Chen, Yuqing Yang · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3790661

Summary

Chen, Lu, Wang, Qiu, Chen and Yang present AskEase, an NVDA add-on that delivers on-demand, step-by-step, screen-reader-friendly guidance for blind and low-vision computer users tackling unfamiliar desktop software. The work responds to a persistent problem: mainstream tutorials assume sight and mouse input, human help is rarely available in real time, and generic LLM assistants return visually oriented instructions ('click the button') that are unusable non-visually. Grounded in prior research and a three-participant participatory formative study with experienced screen reader (SR) users, the authors derived five design goals — situation awareness, preference-aligned responses, actionable and correct guidance, seamless in-flow assistance, and accessible fine-grained interaction — and built AskEase around three core features: Contextual Q&A (free-form queries), Adaptive Support (targeted guidance when users get stuck on a step), and Screen Description (rapid structured summary of active window and focus). The system is powered by a multimodal LLM (GPT-5) and uses 'context engineering' to combine three context types: environment context (desktop screenshot with focus-highlight red border, structured screen state from NVDA's API, live SR gesture and speech traces), knowledge context (curated response-preference principles plus a RAG index over application documentation using text-embedding-3-large, FAISS, and optional HyDE query expansion), and conversational context (chat history and the 'stalled step' the user is viewing when Adaptive Support is invoked). Interaction is non-modal, keyboard-driven, and restores prior focus.

Key findings

Robustness testing on 45 tasks sampled from the Windows Agent Arena benchmark across 12 Windows applications achieved a 96.6% overall success rate (40% solved in a single Contextual Q&A, rising to 96.6% within three Adaptive Support rounds), with the two failures involving inherently non-keyboard operations (Paint drag, red-circle drawing). Average Contextual Q&A latency was 10.06 s at $0.0050 per query; Adaptive Support averaged 8.18 s at $0.0046. A within-subjects user study with 12 screen reader users (6F/6M, aged 21–44, mix of occupations from developer to massage therapist to piano tuner) on Microsoft Word and Excel tasks compared AskEase to participants' usual tools (search engines, Gemini, ChatGPT). With AskEase, participants completed an average of 1.5 / 2 tasks (β = +1.00, p < 0.001) versus 0.5 with baselines; six completed both tasks with AskEase versus two with baselines. NASA-TLX scores showed significant reductions in physical demand (β = −1.25), effort (β = −1.42), and frustration (β = −1.58), and significantly improved perceived performance (β = +1.50). Participants' 67 logged questions split into task-level (24), step-level (27), status-confirmation (7), and troubleshooting (5). Participants specifically praised non-visual descriptions, step-by-step pacing, and seamless shortcut-driven access that preserved focus and eliminated window switching. Trust built gradually through repeated reliable responses; intent understanding faltered on ambiguous prompts and domain terminology; occasional hallucinations (nonexistent menu paths, wrong cell references) remain a limitation.

Relevance

This is a strong, concrete demonstration that LLM-powered assistive agents can meaningfully lower the adoption barrier for blind and low-vision users learning new desktop software — a problem area that accessibility practice has long recognised but rarely addressed at real-time scale. For designers of AI-powered assistive tools, the paper offers several transferable lessons: context engineering that combines live screen state, SR gesture/speech traces, and RAG-indexed documentation outperforms generic prompting; keyboard-invoked non-modal dialogs with Previous/Next Step navigation preserve task focus in ways mainstream chat UIs do not; response-preference principles should be enforced in the system instruction (concise, GUI-element-specific wording, no visual-only descriptions, standard terminology); and uncertainty should be surfaced explicitly rather than guessed. The paper is also a useful reminder that accessibility-by-design still matters — AskEase works best on applications that follow SR conventions and degrades on poorly labeled UIs. Limitations: 12 participants are self-selected LLM-curious users rather than SR-novice or LLM-skeptical populations; NVDA/Windows only (no JAWS or VoiceOver); privacy implications of streaming screenshots and SR traces to cloud models are flagged but not solved; evaluation excluded software participants were already proficient with. Code is released at github.com/microsoft/AskEase.

Tags: accessibility · screen readers · AI · LLM · assistive technology · help-seeking · context engineering · human-AI interaction · blind and low vision · NVDA · RAG · multimodal AI