UIClip: A Data-driven Model for Assessing User Interface Design

Jason Wu, Yi-Hao Peng, Xin Yue Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols · 2024 · Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST '24) · doi:10.1145/3654777.3676408

Summary

This paper introduces UIClip, a computational model that automatically assesses UI design quality and visual relevance from a screenshot and natural language description. Built on OpenAI's CLIP B/32 architecture (151 million parameters), UIClip is fine-tuned on a novel large-scale dataset of 2.3 million UI screenshots paired with quality-annotated descriptions. The training data was created through an innovative synthetic augmentation approach: the authors crawled nearly 300,000 web pages, took screenshots across desktop, tablet, and mobile viewports, then systematically degraded design quality using "jitter functions" — JavaScript snippets that introduce controlled design defects based on the CRAP visual design principles (Contrast, Repetition, Alignment, Proximity). These jitter functions include color swaps, color noise, font size swaps, text contrast reduction, background color changes, spacing noise, complexity reduction (removing elements), and layout modifications. Each original UI was paired with degraded versions, creating ranked pairs for training. To supplement the synthetic data, the authors collected human ratings from 12 designers who compared approximately 1,200 pairs of real-world UI screenshots from the VINS dataset and LLM-generated UIs, producing the BetterApp dataset of 892 quality-ranked pairs with designer-written captions and CRAP principle annotations.

Key findings

UIClip achieved 75.12% average overall accuracy in identifying the preferred UI from a pair, substantially outperforming all baselines including proprietary LVLMs: GPT-4V (52.9%), Claude-3-Opus (60.3%), Gemini-1.0-Pro (54.6%), and open models like LLaVA-1.6-13B (51.4%). UIClip was particularly strong at detecting design defects in web pages (87.1% accuracy on JitterWeb test data). For design suggestion generation, UIClip achieved the highest macro-F1 scores across the four CRAP principles. Notably, all large vision-language models performed poorly on design quality assessment — around the level of random guessing — despite being orders of magnitude larger than UIClip. GPT-4V even refused to respond to ~10% of examples. A qualitative analysis revealed that current LLMs produce "realistic-sounding but inaccurate reasoning" when assessing UI quality — for example, GPT-4V erroneously judged a jittered screenshot with overflowing text as better designed than the original. The paper demonstrates three downstream applications: quality-aware UI code generation (ranking multiple LLM-generated implementations by design quality), UI design tips generation (uploading a screenshot to receive actionable design warnings), and quality-aware UI example retrieval (searching for well-designed UI examples by description).

Relevance

While UIClip focuses primarily on visual design quality rather than accessibility specifically, it has significant implications for accessibility practitioners. Several of the design defects the model detects — poor text contrast, readability issues, inconsistent font sizing, layout problems — directly overlap with accessibility violations. The model's ability to automatically flag contrast and readability problems at scale could complement traditional accessibility checkers, which focus on code-level compliance rather than visual design quality. The finding that even the most capable LVLMs fail at UI quality assessment is important context for anyone relying on AI for accessibility evaluation: visual design judgment remains challenging for general-purpose models and requires domain-specific training. The CRAP principles framework used in the paper (contrast, repetition, alignment, proximity) aligns with accessibility principles around visual clarity and organisation. The jitter function approach to synthetically generating design defects could be adapted for creating accessibility-focused training datasets. For organisations using LLM-based code generation for UIs, UIClip's "best-of-n" selection approach — generating multiple versions and selecting the highest-quality one — offers a practical strategy for improving both design quality and potentially accessibility compliance of generated interfaces.

Tags: UI design · machine learning · design quality assessment · computer vision · CLIP · visual design · design tools · user interface · automated evaluation