StepWrite: Adaptive Planning for Speech-Driven Text Generation

Hamza El Alaoui, Atieh Taheri, Yi-Hao Peng, Jeffrey P. Bigham · 2025 · Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST '25) · doi:10.1145/3746059.3747610

Summary

This paper introduces StepWrite, an LLM-powered voice-based writing system that enables structured, hands-free and eyes-free composition of longer-form texts. While speech-to-text tools handle short dictation well, composing structured emails or detailed responses requires planning, context tracking, and revision — capabilities conventional dictation and voice assistants lack. StepWrite addresses this by decomposing writing into manageable subtasks through an adaptive question-and-answer dialogue. Grounded in pedagogical scaffolding principles, the system asks focused, contextually-aware questions one at a time (e.g., "When would you like to meet him?", "Should I include your address?"), building up the content incrementally rather than expecting users to dictate a complete message. The system architecture is modular: audio processing with noise filtering and voice activity detection feeds into OpenAI Whisper for transcription, then an LLM pipeline generates adaptive questions (using GPT-4.1-mini), classifies tone (14 categories including formal, empathetic, urgent, apologetic), generates text, and runs multi-pass fact checking (using GPT-4o-mini) that verifies consistency between the Q&A inputs and the draft. Users navigate via voice commands ("skip question", "go back", "modify answer", "finish") matched through fuzzy token-level matching with a 0.85 cosine similarity threshold to prevent accidental triggers.

Key findings

In a within-subjects study with 25 participants completing email writing and reply tasks under both stationary and mobile (walking) conditions, StepWrite significantly outperformed Microsoft Word dictation and ChatGPT Advanced Voice Mode across multiple dimensions. StepWrite drafts required approximately 86% fewer word-level edits than dictation and 45% fewer than ChatGPT AVM. StepWrite achieved the lowest NASA-TLX workload score (M=16.8 vs ChatGPT 22.5 vs Dictation 49.2, p<.001), with participants reporting minimal effort and frustration. SUS usability scores were 80.0 (StepWrite) and 83.2 (ChatGPT AVM) — both "excellent" — versus 60.0 for Dictation (below the 68 benchmark). StepWrite earned the highest emotional engagement scores across all five dimensions (engagement, enjoyment, motivation, stress reduction, creativity). Critically, 77.9% of StepWrite's questions were classified as "necessary" by independent annotators — meaning roughly four of every five questions directly enabled content that survived into users' final revised texts (Essential Question Fraction of 0.779). The tone classifier achieved 91.7% overall accuracy across 14 tone categories. StepWrite did take longer overall (Write: M=248s vs ChatGPT 150s vs Dictation 154s), but front-loaded effort into drafting while minimising revision time (~62s vs Dictation's >100s). Semantic diversity analysis showed StepWrite drafts required virtually no meaning-level revision (M=0.011), confirming tight alignment with user intent from the first draft.

Relevance

This work has significant accessibility implications beyond its primary multitasking framing. A manual wheelchair user in the study noted that with one hand always steering, "it's hard to stop and type on my phone, so I'd use StepWrite for everyday emails or school papers." A participant with a learning disability found the incremental questions "superior and let me build a more complete answer." The authors explicitly connect the approach to research showing that structured, step-wise writing interventions benefit people with intellectual and developmental disabilities. For accessibility practitioners, StepWrite demonstrates how adaptive scaffolding — breaking complex cognitive tasks into guided steps — can reduce working memory demands while preserving user agency and authorship. The system's voice-only workflow removes the need for precise finger input and sustained visual focus, making it promising for people with motor impairments, low vision, or learning disabilities. The six design implications for voice-first authoring tools (adaptive over static prompts, process transparency, preference adaptation, lightweight escape hatches, minimising redundant effort, and modality switching support) provide a practical framework for building accessible voice-driven interfaces.

Tags: voice interface · speech-to-text · hands-free interaction · eyes-free interaction · large language models · text composition · cognitive load · adaptive planning · writing assistance · motor impairment · multitasking