Face Tracking User Interfaces Using Vision-Based Consumer Devices
Norman H. Villaroman · 2012 · Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '12) · doi:10.1145/2384916.2385002
Summary
This short paper investigates the design and implementation of face tracking user interfaces built with consumer-grade vision and depth-sensing devices, specifically the Microsoft Kinect. The research addresses the needs of users who cannot effectively use traditional hand-operated input devices such as a mouse and keyboard but retain sufficient head and facial control. The author identifies a core challenge: natural face movements span only a small region of a camera sensor, yet modern applications use high-resolution screens, creating an accuracy gap that is compounded by inherent imprecision in consumer-grade detection and tracking algorithms. The study establishes explicit design requirements for a usable face tracking interface, including complete user independence after initial setup, robustness to natural variance in face position, automatic or minimal calibration, control via facial movements alone (no torso movement required), and accuracy sufficient to avoid frustrating repeated attempts at simple input operations. The implementation uses a C++ application running on Windows 7 with a Kinect depth sensor. An on-screen keyboard provides text input, and directional face-point positioning is used for cursor-like control, with dwell-time activation for selections. The sides of the screen serve as calibrating edges to accommodate varying face positions. Multiple design options were implemented and evaluated to address these requirements.
Key findings
The research demonstrates that consumer-grade depth sensors like the Kinect can serve as the basis for a hands-free computer interface controlled entirely by natural face movements. Key design findings include the importance of screen-edge calibration to handle the inherent imprecision of face tracking at high screen resolutions. The dwell-time approach — holding the face point in a screen area for a few seconds — proved a viable mechanism for triggering actions such as toggling cursor control or activating an on-screen keyboard. The study identifies accuracy as the critical make-or-break factor: erratic responses from imprecise tracking can quickly render the interface unusable. The literature review conducted as part of the research revealed that while related work by Varona et al. existed, methodologies and results differed substantially, and significant unresolved challenges remained across the field. The paper notes that usability testing was planned but not yet completed at time of publication.
Relevance
This early exploration of consumer depth sensors for accessible input is historically significant as it appeared when the Kinect was newly available and showed promise for affordable assistive technology. The design requirements articulated in the paper — independence, robustness, minimal calibration, and accuracy — remain directly relevant to anyone developing head- or face-controlled interfaces today. While the Kinect has been discontinued, the same principles apply to modern webcam-based face tracking systems and emerging AI-powered head tracking solutions. The work highlights a persistent tension in assistive input: consumer hardware is affordable and non-intrusive, but achieving the accuracy needed for reliable daily use remains challenging. Practitioners developing alternative input methods should consider the explicit usability criteria this paper defines as a baseline checklist.
Tags: face tracking · head tracking · assistive technology · alternative input · depth sensing · Kinect · motor disabilities · perceptual user interface