Toward 3D Scene Understanding via Audio-description: Kinect-iPad Fusion for the Visually Impaired

Juan Diego Gomez, Sinan Mohammed, Guido Bologna, Thierry Pun · 2011 · Proceedings of the 13th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS 2011) · doi:10.1145/2049536.2049613

Summary

This demonstration paper presents a computer-vision-based framework that combines a Microsoft Kinect 3D depth sensor with an iPad touchscreen to enable visually impaired users to understand the spatial layout of indoor scenes through audio. The system works in several stages: first, the Kinect captures colour and depth data of the environment. A calibration process aligns the depth and colour images to achieve pixel-level correspondence. Objects are then segmented from the scene using depth-based layering — the Kinect's range is partitioned into multiple layers, surfaces within each layer are labelled as objects, and background layers are filtered out. This approach achieves precise real-time segmentation regardless of colour or lighting conditions, provided objects do not occlude each other. Segmented objects are described by geometric feature vectors (perimeter, area, eccentricity, major axis, bounding box size) and classified using a multi-layer artificial neural network trained offline on in-situ images. Once recognised, each object class is assigned a distinct instrument sound, and the objects' positions are mapped onto the iPad as a proportional top-view of the real scene. Users explore the scene by touching the iPad — finger contact triggers the corresponding object's sound, with spatial virtual sound sources creating the illusion of sounds originating from the objects' real-world locations.

Key findings

The system enables blindfolded users to build a mental occupancy grid of the environment through tactile-auditory exploration. In evaluation, participants first explored the iPad representation while blindfolded, then attempted to physically reconstruct the scene by placing objects back on the table. Comparing photographs of the original and reconstructed scenes showed promising results — users could accurately identify which objects were present and where they were located. The framework addresses a fundamental information bandwidth challenge: vision processes approximately 10^6 bits per second while audition handles only 10^4 bits per second, meaning not all visual information can be encoded into sound. The system manages this by focusing on object identity and location rather than attempting full visual encoding. The depth-based segmentation approach is notably robust to lighting and colour variations, and the perspective-invariant analysis enables accurate top-view generation from a single frontal Kinect image. The multi-touch iPad interface provides an intuitive tactile interaction model where the tablet surface acts as a miniature representation of the physical space.

Relevance

This work demonstrates an early approach to combining low-cost consumer depth-sensing technology with touchscreen devices to create accessible spatial representations for people who are blind. The concept of translating 3D scene understanding into a tactile-auditory interface has broad implications for indoor navigation, workspace awareness, and environmental orientation. For accessibility practitioners, the system illustrates several important design principles: using distinct audio cues (instrument sounds) for object categorisation, spatial audio for conveying position, and a tangible touchscreen as a proxy for physical space exploration. The limitation to simple, non-occluding scenes and the requirement for pre-trained object recognition constrain its practical use, but the underlying approach — fusing depth sensing, object recognition, and sonification into an exploratory tactile interface — foreshadows more sophisticated systems now emerging with improved computer vision and machine learning capabilities.

Tags: sonification · visual substitution · Kinect · computer vision · blindness · object recognition · spatial audio · tactile exploration · electronic travel aid · sensory substitution