Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots

Jason Wu, Xiaoyi Zhang, Jeff Nichols, Jeffrey P. Bigham · 2021 · The 34th Annual ACM Symposium on User Interface Software and Technology (UIST) · doi:10.1145/3472749.3474763

Summary

This paper introduces screen parsing, the task of predicting UI elements and their hierarchical relationships from a screenshot alone. While prior work could detect individual UI elements on a screen (element detection), those approaches produced flat lists of elements with no structural information about how they relate to each other. Screen parsing goes further by reconstructing the full UI hierarchy — the tree structure that describes how elements are grouped into containers, how containers nest within each other, and the semantic relationships between components. This structural information is critical for screen reader navigation, where users need to understand not just what elements exist but how they are organized. The system works in three stages: (i) a Faster-RCNN object detection model identifies UI elements and their types, (ii) a stack-based transition parser built on bidirectional LSTM encoder-decoder architecture predicts the hierarchical relationships between those elements, and (iii) a Deep Averaging Network classifier labels element groups (e.g., Tab Bar, Table, Collection, Toolbar). The approach draws on techniques from natural language processing, adapting dependency parsing methods to work with the spatial layout of UIs rather than sequential text.

Key findings

The screen parsing system outperformed existing baselines by up to 23% depending on the metric used, evaluated on both AMP (130K iOS screens) and RICO (80K Android screens) datasets. On AMP, the system achieved an F1 score of 0.60 for overall tree structure and 0.67 for leaf-level groupings, with a container match score of 0.63. A key technical contribution was the dynamic oracle training procedure, which exposes the model to all optimal action sequences rather than just one canonical sequence, significantly improving performance over the static oracle approach. The paper demonstrates three downstream applications: (i) UI similarity search using structural embeddings that are invariant to visual theme changes, (ii) accessibility enhancement where screen parsing produces better element groupings than heuristic-based Screen Recognition, reducing the number of screen reader swipes needed for navigation, and (iii) code generation from screenshots that produces responsive SwiftUI code using relative positioning rather than absolute coordinates.

Relevance

Screen parsing directly addresses one of the most persistent problems in mobile accessibility: apps that lack proper accessibility metadata. When developers fail to implement view hierarchies or accessibility labels, screen readers like VoiceOver and TalkBack cannot provide meaningful navigation. Screen parsing offers a vision-based fallback that can reconstruct the structural information screen readers need — element groupings, navigation order, and container relationships — purely from what is visible on screen. The accessibility enhancement demonstration shows concrete benefits: better element groupings mean fewer swipes for screen reader users to reach their target content. This approach complements existing systems like Screen Recognition (which uses heuristics to group detected elements) by providing a learned model with global context, overcoming limitations of local pattern-matching heuristics. The work is foundational for building accessibility repair tools that can automatically generate missing metadata for any app.

Tags: screen readers · mobile accessibility · computer vision · UI semantics · machine learning · accessibility metadata · reverse engineering