Beyond Descriptions: A Generative Scene2Audio Framework for Blind and Low-Vision Users to Experience Vista Landscapes

Chitralekha Gupta, Jing Peng, Ashwin Ram, Shreyas Sridhar, Christophe Jouffrais, Suranga Nanayakkara · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791655

Summary

Scene description apps (Seeing AI, Be My AI, Envision) do a reasonable job of telling blind and low-vision (BLV) users what is in front of them, but they are built for utility - 'there is a chair'. They fail at the aesthetic, leisure, emotional experience of distant landscapes: a mountain vista, a beach at sunset, an urban skyline. Gupta and colleagues argue these 'Vista spaces' (Montello's far-field psychological space, apprehended from a single vantage without locomotion) matter for well-being and cognitive engagement, not just orientation, and that the transactional speech-only approach imposes cognitive load while stripping the experience of its gist. The paper proposes Scene2Audio, a two-stage generative framework that turns a Vista image into a layered non-verbal soundscape. Stage 1 (Salient Objects Identification) uses GPT-4 to extract sonic objects from the image as noun+verb action phrases (e.g. 'cows mooing', 'leaves rustling', 'church bell ringing') and labels each as discrete or continuous. Stage 2 (Audio Scene Composition) feeds each phrase into a text-to-audio model (AudioGen), applies psychoacoustic layering (discrete foreground sounds at 0.8, continuous background at 0.2 per Foley conventions), uses an onset-detection method to minimise jarring repetitive events, and mixes a scene-level clip. Three studies evaluate it: (1) a listening test with 21 sighted participants benchmarking Scene2Audio against Im2Wav and Im2text2audio on image-to-sound matching accuracy; (2) a controlled lab study with 11 BLV participants comparing four audio modes - Speech-only, Audio-only, Overlay (speech + non-verbal simultaneously), and Overlay-Concat (per-object speech + non-verbal concatenated); (3) a week-long in-the-wild deployment of the Sonic Vista mobile app with 7 BLV participants submitting 77 photos.

Key findings

The sighted-listener benchmark showed Scene2Audio averaged 62.4% scene-matching accuracy versus Im2text2audio 29.6% and Im2Wav 17.3% (p<.05), with the largest gains on nature scenes (seabeach 76%, countryside 81%, park 90%) and weaker performance on urban scenes (street 24% vs Im2Wav 71%) where inanimate objects produced ambiguous cues. In the lab study, Overlay was the most preferred mode (rank 3.36/4), significantly better than Speech-only (2.23), Audio-only (1.68), and Overlay-Concat (2.73). Overlay scored highest on comprehension (6.23/7 self-rated), immersion (experienced realism 5.18, significantly higher than Speech-only's 3.77, p<.05), and engagement (enjoyability 5.50 vs 3.91 Audio-only, p<.01), while NASA-TLX cognitive load for Overlay (mental demand 2.41, effort 2.27) was no higher than Speech-only and significantly lower than Audio-only. Audio-only alone was confusing and hard to imagine without speech anchoring; Speech-only was clear but 'feels stationary' and lacked immersion. The in-the-wild results flipped the preference: outdoor mobile users preferred the more detailed Overlay-Concat mode (highest 'clearest info' and 'most preferred' across indoor and outdoor scenes), because real-world users already know roughly what is around them and want confirmation and detail rather than brevity. Four qualitative themes emerged: preference for rich descriptions in-the-wild, sound effects balancing immersion and precision (with hallucinations like a rooster for a red panda being jarring), context-dependent usage (relaxation vs utility mode), and requests for lower latency, OCR, and wearable form factors. Latency averaged 15.9s end-to-end, too slow for time-sensitive tasks like reading a bus number.

Relevance

For accessibility practitioners, this is an important reframing: BLV users need leisure, aesthetic, and emotional engagement with their environments, not only wayfinding and object identification. The paper demonstrates that non-verbal audio is a viable complement - not replacement - to speech descriptions, and that the correct combination mode shifts with context (brief overlay for lab/unknown, detailed concatenated overlay for familiar real-world scenes). The design principle that discrete foreground sounds should be sparse and continuous background sounds should be de-emphasised (psychoacoustic foreground/background weighting) is immediately transferable to any designer making sonified interfaces. Limitations are substantial. Only 11 lab participants and 7 in-the-wild participants (all from one Singapore organisation), all instructed to imagine a hilltop/rooftop vantage which may not match their prior spatial experience. Scene2Audio inherits audio-generation hallucinations (P4's 'rooster for red panda'), making it unsafe for navigation, hazard awareness, or decision-making - the authors explicitly caution it is a relaxation-and-exploration tool. High latency and the lab/real-world preference mismatch highlight that accessible AI features designed in controlled settings routinely need re-validation in deployment. Urban scene performance is weak; spatialisation was deliberately excluded. This is a promising direction rather than a shippable feature.

Tags: blind and low vision · sonification · spatial audio · generative AI · psychoacoustics · scene description · image description · mobile accessibility · assistive technology · aesthetic experience