Eyes on the Palm: Investigating a Ring-Shaped Camera for Seamless Accessible Tactile Exploration

Ayaka Tsutsui, Xiyue Wang, Hironobu Takagi, Yoichi Ochiai, Chieko Asakawa · 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26) · doi:10.1145/3772318.3791659

Summary

Tsutsui and colleagues ask how the form factor of a camera-based assistive device shapes the way blind and low-vision (BLV) users coordinate their hands during tactile exploration of real museum exhibits. Smartphone apps such as Seeing AI and Be My AI are designed around a take-a-picture workflow that forces users to lift the device with both hands, which interrupts the very tactile engagement that underpins how BLV visitors understand exhibits. The authors prototype a ring-shaped camera worn on the proximal phalanx of the non-dominant index finger: a 6.4 g ring module containing a 920x736 miniature camera, vibration actuator, and button, tethered to a wrist module with a Zynq UltraScale+ MPSoC that streams frames over Wi-Fi. Because the camera faces outward from the palm, the user can turn their hand to 'look' at something without detaching from the exhibit. The work proceeds through a formative study (13 BLV participants to refine wearing position), a Wizard-of-Oz study (Study 1, 11 BLV participants) comparing the ring camera with an iPhone 15 Pro across four life-sized International Space Station exhibits at Miraikan, and a functional-system evaluation (Study 2, 6 BLV participants) running real-time YOLOv8 object detection, Depth Anything V2 distance estimation, and GPT-4o scene descriptions triggered by single, double, and triple button presses. Grounded in Guiard's theory of asymmetric bimanual action and Dual Coding Theory, the authors analyse how exploration, inquiry, and processing phases alternate across the two devices, using Raw-TLX workload ratings, seven-point usability Likert items, and coded observation of hand-use strategies.

Key findings

In Study 1, the ring camera produced significantly lower perceived workload than the smartphone (Raw-TLX median 35 vs 52; W = 0, p < .001), and was rated higher on ease of use, usefulness, enjoyment, and willingness to reuse (all p < .05). Item-identification accuracy was also higher with the ring (89.9% vs 83.6%; t(10) = 2.38, p < .05). The behavioural analysis is the real contribution: with the smartphone, 79% of inquiry-phase trials involved both hands leaving the exhibit to hold and aim the device, collapsing the tactile anchor; with the ring camera, at least one hand stayed anchored on the exhibit in 78.9% of inquiry trials, preserving the spatial reference frame. During the processing phase (listening to audio descriptions), the ring-camera condition saw 94% of trials with at least one anchored hand, versus roughly 19% for the smartphone. In Study 2 the live system achieved 88.3% YOLO detection accuracy and 73.6% accuracy on GPT-generated descriptions; participants identified 75-100% of the 16 target items in the private-room exhibit and rated the system 6-7 across most usability items. Remaining failures clustered around misrecognition of visually similar objects (audio-panel vs computer, comb vs zipper), hand-tremor-induced blur, and distance estimates mismatched to user perception. Participants strongly preferred button input over voice commands for privacy reasons in public settings and wanted confidence scores, re-scanning prompts, and training data co-designed with BLV users rather than collected from sighted ones.

Relevance

For accessibility practitioners and researchers, this study reframes the evaluation of camera-based visual access tools. The dominant framing has been recognition accuracy or response latency; Tsutsui et al. demonstrate that the form factor itself is a first-order accessibility variable because it determines whether the user can preserve the tactile anchor that carries their spatial understanding of the object being queried. The design implications are directly reusable: immediate haptic feedback when an object is centred in frame, button-triggered explanation rather than always-on narration, button-triggered stop to abort overflowing descriptions, a global overview mode to scaffold mental models, and distance feedback for depth-rich settings. The work also contributes a practical proof that today's off-the-shelf pipelines (fine-tuned YOLOv8, Depth Anything V2, GPT-4o) can be chained into a live wearable device, but that domain-specific training data and BLV-specific hand-viewpoint images are needed for reliable performance. Limitations include a small sample (17 participants across two studies), a single thematic domain (ISS exhibits), and session-length evaluation only. The paper complements the recent wave of smart-glass and finger-worn work and provides unusually strong behavioural evidence for why wearable position matters, not just whether wearable cameras help.

Tags: wearable technology · assistive technology · blindness and low vision · visual impairment · tactile exploration · bimanual interaction · museum accessibility · computer vision · image description · human-computer interaction