EditScribe: Non-Visual Image Editing with Natural Language Verification Loops

Ruei-Che Chang, Yuxuan Liu, Lotus Zhang, Anhong Guo · 2024 · ASSETS '24: Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility · doi:10.1145/3663548.3675599

Summary

This paper introduces EditScribe, a prototype system that makes image editing accessible to blind and low vision (BLV) users through natural language interaction powered by large multimodal models (LMMs). Image editing is inherently visual and iterative — users need to see the current state of an image, make precise manipulations, and evaluate results — making it one of the most challenging creative tasks to make non-visually accessible. EditScribe addresses this through a structured workflow with three phases: comprehension, editing, and verification. In the comprehension phase, users receive AI-generated descriptions of the image at both general and object levels, building a mental model of the visual content. In the editing phase, users specify edits using open-ended natural language prompts (e.g., "remove the person on the left" or "change the sky to sunset colours"). The system processes these through a pipeline combining GPT-4V for understanding and DALL-E for image generation. In the verification phase — the paper's core contribution — EditScribe provides four types of feedback: a summary of visual changes between before and after images, an AI judgement on whether the edit matches the user's intent, and updated general and object-level descriptions. Users can ask follow-up questions to probe specific aspects of the edit before deciding to accept, undo, or refine it. The system was evaluated with ten BLV participants who performed a series of image editing tasks.

Key findings

All ten BLV participants successfully completed image editing tasks using EditScribe, including object removal, colour changes, style transfers, and adding new elements. The verification feedback was critical to participant confidence — without it, participants had no way to assess whether edits matched their intent. The four feedback types served complementary roles: change summaries helped users quickly understand what changed, AI judgements provided a binary assessment of success, and updated descriptions let users build a comprehensive mental model of the edited image. Participants developed distinct prompting strategies: some gave highly specific instructions ("move the cat 2 inches to the right"), while others used more abstract descriptions ("make it feel warmer"). The study revealed that spatial editing (repositioning, resizing) was more difficult to specify and verify through language than semantic editing (colour changes, style transfers, object removal). Participants expressed strong desire for undo/redo functionality and the ability to make incremental refinements. Trust calibration was a recurring theme — participants needed to develop appropriate trust in AI-generated feedback, and some expressed concern about AI hallucinations in the verification descriptions. The iterative loop of edit-verify-refine was essential, with participants averaging 2-3 verification cycles per edit.

Relevance

EditScribe represents a significant advance in making creative visual tools accessible to BLV users, a population historically excluded from image editing and visual authoring. The natural language verification loop paradigm has implications beyond image editing — any visual creative task (graphic design, video editing, data visualization creation) could benefit from similar feedback mechanisms. For accessibility practitioners, the work demonstrates that AI-powered description and verification can bridge the gap between visual output and non-visual comprehension, but also highlights important limitations: spatial reasoning through language remains difficult, AI descriptions can hallucinate or omit details, and building trust in AI feedback requires careful design. The finding that BLV users want creative control — not just accessibility accommodations — reinforces the principle that accessible tools should empower creative agency rather than merely provide functional access. The verification loop concept also has broader relevance for any scenario where users cannot directly perceive the output of their actions.

Tags: blind and low vision · image editing · generative AI · large multimodal models · natural language interaction · creativity support · visual authoring · non-visual interaction