Never-ending Learning of User Interfaces

Jason Wu, Rebecca Krosnick, Eldon Schoop, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols · 2023 · Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) · doi:10.1145/3586183.3606824

Summary

This paper introduces the Never-ending UI Learner, an automated system that continuously crawls real mobile applications to learn semantic properties of user interfaces. The system addresses a fundamental limitation of current approaches to UI understanding: most machine learning models rely on static screenshots labeled by human annotators, a process that is expensive, slow, and surprisingly error-prone. For example, when asked to label whether a UI element is tappable from a screenshot alone, human annotators must guess based on visual cues without actually being able to interact with the element. The Never-ending UI Learner takes a radically different approach by installing real apps from the iOS App Store onto remotely controlled devices, then crawling them by performing actual interactions — taps, drags, and swipes — and observing what happens. This interaction-based approach generates labels through heuristics that compare before-and-after screenshots, producing more reliable ground truth than human annotation of static images. The system uses a distributed coordinator-worker architecture with 40-100 crawler workers connected to real mobile devices via VNC. Notably, the crawler deliberately avoids using the accessibility API, instead interacting visually just as a sighted user would, which makes it platform-agnostic and able to learn from the same interface a user sees. Over the course of the experiments, the system crawled for more than 5,000 device-hours, performing over half a million actions across 6,000 apps — an order of magnitude larger than existing human-annotated UI datasets.

Key findings

The system successfully trained three computer vision models entirely or partially from crawler-generated data. The tappability model achieved an F1 score of 0.860 after five crawl epochs, reaching performance comparable to models trained on human-annotated data (F1=0.81) but with the advantage of higher-quality labels validated against actual interaction outcomes. The draggability model reached F1 of 0.794, representing the first dataset and model for this interaction type — no prior labeled datasets for draggability existed. The screen similarity model improved from F1 of 0.636 to 0.663 through fine-tuning with crawler-mined examples. Importantly, when models trained on human-annotated data were evaluated on crawler-labeled data, performance dropped significantly (F1=0.60), suggesting the human annotations were less reliable than interaction-based labels. The authors found that random crawling was a surprisingly strong strategy, often outperforming uncertainty sampling, while a hybrid approach worked best for draggability prediction.

Relevance

This research has significant implications for accessibility. Accurate tappability and draggability prediction can directly improve screen reader support by identifying interactive elements that lack proper accessibility metadata — a pervasive problem in mobile apps. The system could automatically detect UI elements that appear tappable but are not (or vice versa), helping identify accessibility barriers at scale without manual auditing. The never-ending learning approach means models can continuously adapt to evolving app design trends, keeping accessibility tools current. While the current implementation focuses on iOS, the visual interaction approach is platform-agnostic and could extend to Android and web interfaces. The work also highlights how human annotation of UI properties is unreliable, reinforcing the need for interaction-based testing in accessibility evaluation.

Tags: machine learning · mobile accessibility · UI semantics · automated crawling · tappability · screen readers · computer vision