WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

Jason Wu, Siyan Wang, Siman Shen, Yi-Hao Peng, Jeffrey Nichols, Jeffrey P. Bigham · 2023 · Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems · doi:10.1145/3544548.3581158

Summary

This paper introduces WebUI, a large-scale dataset of approximately 400,000 web pages automatically crawled and paired with visual, semantic, and stylistic metadata extracted from the browser engine. The dataset addresses a critical bottleneck in UI understanding research: existing datasets like Rico (72K Android screens) are expensive to create, rely heavily on manual annotation, and become outdated as design trends evolve. WebUI exploits the fact that web pages are authored in HTML, which naturally exposes semantic information through tags, ARIA attributes, and the browser-generated accessibility tree — providing a rich, automatically extractable source of UI annotations. Each web page was rendered across six simulated devices (four desktop resolutions, a smartphone, and a tablet) to capture responsive layout variations, and metadata was collected including the accessibility tree, computed styles, and bounding box information for each element. The dataset is an order of magnitude larger than existing UI datasets and cost only about $500 to collect over three months. The authors demonstrate that this web data can improve the performance of visual UI understanding models in the mobile domain through three transfer learning strategies: inductive transfer learning for element detection, semi-supervised self-training for screen classification, and unsupervised domain adaptation for screen similarity.

Key findings

Pre-training element detection models on WebUI before fine-tuning on mobile data improved mean average precision (mAP) from 0.77 to 0.81, with the largest dataset (350K) achieving the best results. For screen classification, incorporating unlabeled WebUI data through self-training improved accuracy by 5% over the baseline trained only on the small Enrico dataset (1,460 samples). Screen similarity models trained on web data achieved an F1 score of 0.95 and could be applied directly to mobile app screens without any mobile training data, demonstrating strong zero-shot cross-domain transfer. Models trained only on web data could detect familiar element types (text, images, buttons) on Android screens with reasonable accuracy, suggesting shared visual patterns across platforms. The study also found that one-third of interactive web elements fell below WCAG minimum touch target size guidelines (44x44 pixels), and that class-balanced resampling of the training data consistently improved performance across all three tasks.

Relevance

This work has direct implications for accessibility tool development. The accessibility tree — the core semantic data source for WebUI — is the same structure that screen readers rely on, making this research deeply connected to assistive technology infrastructure. By demonstrating that web semantics can transfer to mobile UI understanding, the paper opens pathways for building cross-platform accessibility evaluation tools that leverage the web's richer semantic annotations to improve detection of UI elements in less-annotated mobile environments. The finding that models can identify clickable elements, headings, and navigation landmarks from visual appearance alone could enable accessibility auditing tools that work from screenshots, useful for evaluating apps where source code or accessibility APIs are unavailable. The dataset's automatic collection approach also means it can be continuously refreshed, keeping pace with evolving web design practices.

Tags: machine learning · computer vision · UI modeling · web semantics · transfer learning · datasets · mobile accessibility · accessibility metadata

Standards referenced: WCAG · ARIA