Towards LLM-powered Assistive Drone for Blind and Low Vision Users

Yize Wei, Ibnu Taimiyyah Bin Adam, Hanjun Wu, Moritz Messerschmidt, Wei Tsang Ooi, Christophe Jouffrais, Suranga Nanayakkara · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791343

Summary

Wei and colleagues built and evaluated a voice-based assistive drone prototype for blind and low-vision (BLV) users that leverages GPT-4o in two stages: generating step-by-step Python drone-control code from natural-language commands, and interpreting images captured by the drone's onboard camera to answer questions about the environment. The hardware combines a Tello EDU quadcopter (5 MP front-facing camera, 13-min flight time), a ThinkPad laptop running the system logic, and a thumb-sized wireless Flic button for push-to-talk input; software pipelines Google Speech-to-Text, GPT-4o code generation, Tello SDK execution, and GPT-4o vision-based tasks, with Google Text-to-Speech delivering verbal feedback. The research is structured in three phases: a formative study with nine BLV participants exploring envisioned use cases and preferred interaction modalities; a three-iteration participatory design process drawing on feedback from three BLV co-designers and five domain experts (in HCI, haptics, human-drone interaction, design, and audio); and a final user study with six additional BLV participants evaluating the prototype in a controlled indoor environment through three exploration tasks (object localization, object recognition, spatial orientation) rated on task success, SUS usability, and semi-structured interview. The authors frame the work through Norman's execution gap, treating LLMs as a bridge for 'intention as action' that lets non-programmer end-users direct robots through natural-language commands rather than fixed command sets or joysticks.

Key findings

Formative findings: BLV participants wanted drones for tasks current assistive tech struggles with — locating misplaced objects in hard-to-reach places, mapping unfamiliar layouts (washroom, exit), previewing walking path conditions far ahead, and reading bus numbers at a distance. Voice emerged as the preferred input but with strong emphasis on natural, flexible phrasing; rigid wake-words and assistant syntax were described as frustrating. Output audio had to be concise because BLV users depend on ambient hearing for safety. Evaluation results: the prototype achieved a mean SUS of 73.3 (median 73.7, SD 14.9), with task success ratings of 4.0/5 for object localization, 3.8/5 for object recognition, and 3.2/5 for spatial orientation — the last was hardest because LLMs struggle with spatial reasoning and because the drone camera is often oriented opposite to the user. Participants issued 71 queries (39% initial, 35% follow-up, 25% other), with 85% reaching the vision-based task stage. AI error rates across the pipeline: 5.6% speech-to-text, 7.0% code generation, 6.4% vision recognition (including hallucinations), and 6.4% partially correct answers. Iterative design changes included replacing wake-word activation with the tactile Flic button, adding multi-threading to run API calls concurrently with drone movement, limiting response verbosity, and adding multi-turn conversation memory. Participants raised concerns about propeller noise masking ambient safety cues, drones lacking the social recognition of white canes, privacy of camera capture, and battery life.

Relevance

For accessibility practitioners, the paper is a working example of how LLMs lower the barrier to end-user programming of assistive robots — 'intention as action' — and concretely operationalizes safety constraints (predefined function libraries, waypoint-only movement, verbal confirmation before execution). The formative study is a useful catalog of BLV-specific use cases drones address that wearable cameras or smart canes cannot, namely egocentric views 'beyond one's immediate vicinity.' Three design takeaways generalize: use tactile push-to-talk instead of wake-words for BLV users; keep audio output short because hearing is a safety channel; and accept that LLM errors are inherent, so build verification and follow-up question flows. Limitations include the indoor-lab setting, short-term single-session observations, six-participant evaluation sample, heavy reliance on constant network connectivity, and lack of evaluation against wearable camera baselines like Be My Eyes or Seeing AI. The discussion also flags ethical and social barriers (propeller noise, privacy, legal ambiguity around civilian drones, lack of public recognition as an assistive device) that would likely dominate any real-world deployment beyond the lab.

Tags: blind and low vision · assistive technology · drone · LLM · large language models · voice interaction · human-robot interaction · AI accessibility · participatory design · assistive robotics