Transforming Web Pages to Become Standard-Compliant through Reverse Engineering

Benfeng Chen, Vincent Y. Shen · 2006 · Proceedings of the 2006 International Cross-Disciplinary Workshop on Web Accessibility (W4A) · doi:10.1145/1133219.1133223

Summary

This paper addresses the problem of transforming legacy, non-standards-compliant web pages into valid HTML with proper separation of content and presentation — a fundamental requirement for web accessibility. In 2006, over 95% of web pages failed W3C validation, largely because developers used HTML TABLE elements for page layout rather than CSS, and mixed presentational markup (FONT, B, I, U elements) with content. The authors present PURE (Page clean-Up through Reverse Engineering), a tool that takes a novel approach: rather than trying to fix invalid HTML source code directly (as HTML Tidy does, often breaking page appearance), PURE starts from the browser's rendered output and works backward. The process has three stages: preprocessing (rendering the page in a browser engine and analyzing the DOM tree structure), layout reconstruction (segmenting the page into "primary boxes" and rebuilding the layout using CSS positioning with nested DIV elements and floats), and box filling (traversing each primary box's DOM subtree to generate valid HTML while converting presentational elements to CSS). PURE uses a recursive segmentation algorithm that splits the page horizontally into rows and vertically into columns, mirroring how developers manually construct CSS layouts. The tool was implemented in C++ using Internet Explorer's MSHTML rendering engine.

Key findings

PURE was tested against the homepages of 440 of the top 500 websites (60 could not be saved locally due to server dependencies). Of these, 224 pages (51%) were successfully transformed with at least 80% visual similarity to the original — the threshold for acceptable output. Within the successful pages, 70 (16%) achieved 100% similarity, 51 (12%) achieved 90%, and 103 (23%) achieved 80%. The 216 failures (49%) were primarily caused by inconsistencies between IE's MSHTML rendering engine and the W3C's DOM tree specification, not fundamental flaws in the approach. The tool demonstrated that layout tables — used by approximately 470 of the top 500 sites at the time — could be automatically replaced with CSS-based layouts. The authors also noted that the reverse engineering approach could be extended to generate mobile-compatible versions of web pages by rearranging primary boxes to fit smaller screens.

Relevance

While the specific technology landscape has changed dramatically since 2006 (CSS layout is now standard practice, and layout tables are largely a legacy concern), this paper captures an important moment in web accessibility history when the vast majority of the web was built with inaccessible markup patterns. The core principle remains relevant: separating content from presentation is foundational to accessibility, enabling screen readers to interpret page structure, allowing content to reflow for different devices and zoom levels, and supporting user stylesheets for personalization. The reverse engineering approach — working from rendered output rather than source code — is an interesting technique that has parallels in modern accessibility remediation tools, overlay technologies, and browser extensions that attempt to improve the accessibility of pages they cannot modify at the source. The paper also illustrates how deeply intertwined web standards compliance and accessibility are: fixing validation issues is often a prerequisite for, rather than separate from, making content accessible.

Tags: web standards · HTML validation · CSS · automated remediation · layout tables · reverse engineering · separation of concerns

Standards referenced: WCAG 1.0 · HTML 4.01 · CSS 2 · XHTML