SmartWrap: Seeing Datasets with the Crowd's Eyes
Steven Gardiner, Anthony Tomasic, John Zimmerman · 2015 · Proceedings of the 12th International Web for All Conference (W4A 2015) · doi:10.1145/2745555.2746652
Summary
This paper presents SmartWrap, a Firefox extension that enables sighted web users — including nonprogrammers — to create reusable "wrappers" that extract the semantic structure of visually-presented datasets on web pages, making them navigable by screen readers. The core problem is that web datasets (product listings, search results, book catalogs, etc.) are often laid out using visual CSS techniques rather than semantic HTML table markup, rendering them extremely difficult for screen reader users to understand and navigate. Screen readers attach special table navigation semantics to the HTML <table> tag, but empirically, <table> correlates poorly with actual tabular data: it is overused for visual layout and underused for actual datasets. SmartWrap addresses this by leveraging the familiar task of manually scraping web data into a spreadsheet. Users drag content from a webpage into a spreadsheet-like sidebar, mapping just two rows of examples. SmartWrap then uses programming-by-example techniques to generalize these examples into a reusable wrapper that can extract the full dataset and apply to other pages using the same template. The tool was designed with careful attention to nonprogrammer cognition: it shows only selectable HTML elements via blue highlighting boxes, uses concrete drag-and-drop interactions rather than abstract schema definitions, and lets users verify results by previewing the extracted table.
Key findings
A lab study with 30 participants of diverse technical backgrounds showed an 88% overall task success rate, with 72% of wrappers completed in under 5 minutes and 94% in under 10 minutes. Programmers had a higher success rate (97%) than nonprogrammers (83%), with the gap widening for complex tasks involving nesting and multiple datasets. A larger Mechanical Turk study with 241 workers completing 4,133 tasks found that 45% of submitted wrappers were high quality, 38% plausible, and 17% incorrect. Cost estimates for crowdsourcing wrappers were modest: approximately \/bin/zsh.45 for a table-layout dataset, \.00 for a list, and \.05 for a grid (to achieve 95% confidence in at least one high-quality wrapper). The researchers estimated that 58-62% of web datasets could be successfully wrapped by crowdworkers using the tool. Notably, 427 voluntary (unpaid) uses of the tool were observed during the MTurk deployment, visiting 194 different web domains, suggesting users were motivated either by a desire to improve accessibility or to use the tool for their own scraping needs. An important finding was that the HTML <table> tag poorly represents actual dataset presence: turkers were less likely to identify table-layout pages as non-datasets, and more likely to misidentify list pages as non-datasets.
Relevance
SmartWrap represents a creative approach to a persistent accessibility problem: the gap between how data is visually presented on the web and what screen readers can convey about that data's structure. Rather than waiting for web authors to add proper semantic markup (which decades of advocacy have failed to achieve at scale), the system enlists the crowd of sighted users to supply the missing semantics. This crowdsourcing model — where accessibility improvements are produced as a byproduct of a task users already want to do (scraping data) — is a pragmatic strategy that avoids relying solely on developer goodwill. For accessibility practitioners, the paper quantifies a problem that is often discussed anecdotally: that visual layout and semantic markup are deeply misaligned on the real web. The finding that voluntary contributions occurred without solicitation supports the broader idea that accessibility improvements can be crowdsourced if the barrier to contribution is low enough. The limitation is that the system was not yet deployed to actual screen reader users at the time of publication.
Tags: crowdsourcing · screen readers · web accessibility · semantic annotation · web scraping · end user programming · data tables · nonvisual accessibility · wrapper construction