GuideMe: A VLM-Based System Assisting Independent Smartphone Learning for Older Adults

Kairong Fang, Jiesi Zhang, Shi-Ting Ni, Pan Hui, Yuyang Wang · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791448

Summary

GuideMe tackles a problem familiar to anyone who has watched an older relative struggle to learn a new smartphone app: the combined weight of declining vision, memory, and motor function plus unfamiliar terminology makes independent learning extremely difficult, while in-person help is often unavailable, rushed, or comes with psychological baggage (fear of bothering family, shame, loss of dignity). The authors begin with a formative study of 16 Chinese older adults (aged 55-86) using semi-structured interviews, task-driven contextual observation across six common apps, and eye-tracking. They deconstruct what makes effective in-person instruction work - describing the problem, confirming intent through counter-questions, and pointing directly at the UI element - and identify three specific obstacles older adults face when learning alone: problem communication, visual search, and cognitive load. From these findings the authors derive three design implications and build GuideMe, an Android application that combines screenshot capture (Media Projection API) and the Accessibility Service UI tree with a Vision-Language Model (GPT-5). When the user asks a question by voice, the VLM analyses the multimodal context, asks clarifying questions to confirm intent, locates the relevant UI element, and then produces a semi-transparent overlay that 'hollows out' the target element - simulating the deictic gesture of a human helper. A follow-up user study (N=18, mean age 63.6) compared GuideMe against an LLM-powered search engine (Baidu) and a confederate in-person tutor, using a within-subjects Latin-square design and measuring task load (NASA-TLX), usability (UEQ-S), and objective metrics from screen recordings and eye-tracking.

Key findings

GuideMe roughly matched in-person instruction and substantially outperformed the AI search engine baseline. Task completion time averaged 56.5s with GuideMe vs. 82.9s with search vs. 48.7s in-person. UI searching time (eye-tracking-derived) was 1.26s with GuideMe vs. 4.23s with search vs. 0.81s in-person - in-situ highlighting nearly eliminated the visual search cost. Incorrect clicks averaged 0.11 with GuideMe, 2.94 with search, and 0.55 in-person. NASA-TLX mental demand was 1.72 (GuideMe), 5.33 (search), 1.44 (in-person); on most UEQ-S and NASA-TLX dimensions GuideMe showed no significant difference from in-person help and was significantly better than search (Friedman test with Wilcoxon signed-rank pairwise comparisons, Bonferroni-corrected). Asking time was longer with GuideMe (26.3s) than with the search engine (18.3s), because the clarifying-question loop forced users to refine vague initial queries - a deliberate trade-off that produced better downstream outcomes. Eye-tracking revealed an 'inattentional blindness' pattern: in the independent condition, several participants' gaze swept over the correct button without registering it. Qualitatively, three themes emerged: GuideMe scaffolded intent articulation, minimised visual search via its 'glowing spot' anchor, and acted as a socially-safe supplement when human help was unavailable. A minority raised privacy concerns about screen reading.

Relevance

For practitioners this paper is a concrete design case study in cognitive accessibility for a population that digital inclusion efforts consistently underserve. The contribution is not the VLM itself but the interaction pattern: clarifying questions to offload intent articulation, plus a visual anchor that functions as a machine-generated deictic gesture. That pattern generalises well beyond older adults to anyone with cognitive load constraints - people with aphasia, dementia, learning disabilities, or simply those encountering an unfamiliar interface under stress. The study also offers a usable pattern for reducing screen-search burden using standard Android primitives (Accessibility Service + overlay windows) that accessibility engineers already know. Limitations are significant: all participants were Chinese and most used only mainstream apps (WeChat, Taobao, AliPay); the VLM requires cloud processing which raises privacy questions acknowledged but not resolved; and the system depends on well-structured Accessibility trees, so poorly-built apps will degrade the experience.

Tags: older adults · aging · cognitive accessibility · vision-language model · conversational agent · independent learning · smartphone accessibility · in-situ guidance · mobile accessibility · assistive technology · intent articulation · clarifying questions