Judge: Effective State Abstraction for Guiding Automated Web GUI Testing

Chenxu Liu, Junheng Wang, Wei Yang, Ying Zhang, Tao Xie · 2026 · ACM Transactions on Software Engineering and Methodology · doi:10.1145/3736162

Summary

This paper presents Judge, a novel approach to state abstraction for automated web GUI testing (AWGT). AWGTs explore web applications by performing GUI actions and building a state model to guide further exploration, maximising code coverage within a fixed time budget. Effective state abstraction groups pages that exhibit the same behaviour from a testing perspective into the same state, preventing testing tools from getting stuck in loops or re-exploring the same areas. The core challenge Judge addresses is that existing approaches rely on DOM-based or visual similarity metrics with predefined thresholds or learning-based classifiers, but these struggle because functionally identical pages can exhibit substantial structural differences. The paper identifies four causes: dynamically loaded data (changing text and images between visits), dynamically generated HTML attributes such as unique IDs generated by frameworks like React or Ember.js, dynamic element reordering where DOM elements appear in different positions due to asynchronous loading, and extendable elements where lists and grids vary in the number of items displayed. Judge uses a merge-and-classify strategy. In the merge phase, a tree-hash-based algorithm traverses the DOM and collapses sibling elements sharing identical subtree structures into one, stripping text content and HTML attributes. In the classify phase, simplified DOMs are converted to vector embeddings using a contrastive learning model with BERT or Longformer encoders and classified using an SVM. The paper references W3C HTML Living Standard and WCAG 2.1 semantic tagging conventions as rationale for relying on HTML tag structure. Evaluated on datasets with approximately 100,000 page pairs, Judge demonstrates consistently superior performance across 13 baseline approaches.

Key findings

Judge outperforms all 13 baseline approaches on three manually labelled datasets in macro-averaged F1 score, with improvements ranging from 8.95% to 28.90% over threshold-based baselines and 8.95% over WebEmbed, the strongest deep-learning baseline. On the held-out SS test set, improvements over baselines reach 6.92 to 99.32%. The structure-merging algorithm alone, applied as a preprocessing step to existing baselines, improved their F1 scores by an average of 11.4 to 17.0% for threshold-based methods and 4.8% for a GNN approach. In AWGT guidance experiments on six open-source web applications, Judge improved JavaScript branch coverage by an average of 2.62 to 14.12% compared to the five most effective competing approaches. The structure-merging algorithm reduced average DOM length by 94.1% on the training dataset, making embedding far more efficient. Judge was 99.5% and 97.1% faster than Levenshtein and RTED algorithms respectively. Ablation studies confirmed that both the structure-merging phase and contrastive learning phase each contribute meaningfully. Judge is the only approach tested that consistently achieves stable high coverage across all six web applications, demonstrating strong generalisability.

Relevance

While focused on software testing rather than accessibility directly, Judge has meaningful implications for accessibility evaluation at scale. Automated web GUI testing tools explore web applications for defects, and the same crawling infrastructure can be configured to check for accessibility violations during exploration. Judge's more accurate state abstraction prevents crawlers from looping and ensures broader application coverage within fixed time budgets, which directly translates to discovering more accessibility issues across more application states. The paper explicitly references WCAG 2.1 and W3C HTML semantics as a design rationale, noting that semantic HTML tags carry behavioural meaning that structural analysis can exploit. For accessibility practitioners considering automated testing pipelines, Judge illustrates that the quality of state modelling is a critical upstream factor in how thoroughly a tool can audit a web application. Limitations include reduced effectiveness with canvas-heavy or heavily obfuscated frontends, and datasets primarily drawn from open-source apps that may not fully represent enterprise accessibility contexts.

Tags: automated testing · web accessibility · GUI testing · DOM · machine learning · contrastive learning · state abstraction · accessibility evaluation

Standards referenced: WCAG 2.1 · W3C HTML Living Standard