Mapping Movies: A Mind-Map Approach to Aphasia-Friendly Video

Shayan Bali, Alexandre Nevsky, Filip Bircanin, Timothy Neate · 2026 · Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems (CHI EA '26) · doi:10.1145/3772363.3798702

Summary

Bali, Nevsky, Bircanin, and Neate (King's College London) target an under-served corner of media accessibility: viewers with aphasia and other complex communication needs (CCNs) for whom subtitles, audio description, and existing simplified-media interventions do not substantially lower comprehension load. They propose VideoMind, a web-based AI tool that converts any input video into an interactive mind-map summary. The backend chains Whisper (automatic speech recognition) for transcription, a Llama-3.1-8B-Instruct model for emoji prediction, keyphrase extraction, and bullet-wise summarisation, a Llama-3.2-1B variant for topic modelling, named-entity recognition for surfacing people, places, and dates, and CLIP-ViT-B/32 for segment-representative keyframe extraction. The frontend presents the result as a mind map organised chronologically and thematically: a central topic node, child nodes per video segment (with keyframe, short summary, keywords, emojis, play/listen buttons), and navigation synchronised with video playback. Three views are offered — Split View (video + map), Video Summary (video + live transcription with surrounding context), and a full-screen Mind Map View. Design choices follow aphasia-friendly literature: reduced visual clutter, stepped revelation of interface elements, consistent icon-text pairing, large-font readability, and TTS to support reduced reading demands. Two people with aphasia who are experienced prototype co-designers (7+ years of feedback practice) trialled the tool across news, historical, and documentary clips in one-to-one think-aloud sessions, followed by a structured questionnaire and interview.

Key findings

Thematic analysis surfaced two main findings. First, 'Selective Navigation for Focused Engagement': the participants with aphasia did not passively consume videos start-to-finish but instead used the mind map to triage content, skip fatigue-inducing segments, and jump directly to items of interest without scrubbing. P1 described wanting to 'jump to the best part of the news, cut the crap out and just listen to what I want'; both participants found keyframe-and-summary previews a useful decision aid ('quickly look at the map'). Chunking the video into manageable revisitable units was considered as important as the summaries themselves, because it let participants treat segments as independent tasks to be returned to across days. Second, 'Managing Overload through Simplified Interfaces and Controls': the Split View proved cognitively taxing — 'too much information' (P2), 'I'm looking at this, this, this, this' (P1). Participants proposed on-demand overlays rather than simultaneous presentation, an auto-focus-on-active-node interaction, and a 'Shazam-like' quick lookup for instant context about a confusing segment. The authors synthesise these into two design implications: 'restricted control' as the key benefit (an overview-first navigation model that lets users decide what to watch, in what order, and in what dose) and 'adaptive disclosure' (multimodal summarisation materials that appear on demand rather than always-on, to avoid introducing new barriers through interface density).

Relevance

This is an early-stage prototype paper (N=2), so the contribution is design-space shaping rather than validated impact. The design moves are nonetheless instructive for practitioners working on media accessibility beyond the standard DHH and BLV interventions: people with aphasia, intellectual disability, early-stage dementia, ADHD, and autism share the pattern of needing restructured rather than merely annotated content, and an overview-first interactive summarisation layer is a direction mainstream video players like Amazon X-Ray currently do not pursue. The critique of the side-by-side 'split view' as cognitively exhausting — despite being an obvious engineering choice — is a useful caution against layering additional modalities on top of video without controls for when and how they appear. The pipeline itself is buildable with open-weight models (Whisper, Llama 3.1/3.2, CLIP) running locally, so the technical bar for experimentation is lower than with proprietary frontier APIs. Limitations are openly acknowledged: tiny sample, expert co-designers rather than typical users, no longitudinal use, no measurement of comprehension gain beyond self-report, and dependence on metadata-model quality — unreliable summaries risk misleading rather than helping aphasic viewers.

Tags: aphasia · complex communication needs · video accessibility · cognitive accessibility · mind maps · artificial intelligence · media accessibility · text-to-speech · visualisation