Automated accessibility testing

Five proof-of-concept demonstrations of AI-driven accessibility testing — the territory commercial automated scanners cannot reach. Each demo applies Claude Opus to a specific class of accessibility issue, with the exact prompt, the captured Claude response, and an honest discussion of what worked and what did not. Open-source GPL-3.0 at github.com/bobdodd/automated-testing (the project was originally named a11yAuto; renamed in May 2026 to make the content more legible).

What this page is — and what it isn’t

These are exploratory experiments, not commercial-tool replacements. The page demonstrates that the AI-driven approach works for some categories of accessibility issue that classic automated tooling cannot find — and is honest about where the approach is currently fragile. The strongest reading of this page is “Bob explores AI-for-accessibility carefully, with worked examples, and tells you where the approach succeeds and where it fails.” The strongest reading is not “Bob has built AI accessibility tools that work in production.” That distinction matters, both for the integrity of the Oracle frame and for not getting caught overclaiming.

Each demo follows the same shape: what the test attempts to detect; which WCAG success criteria are involved; the prompt verbatim from the upstream repository; the captured Claude response that the prompt produced; and a discussion section where the PoC honesty actually lives — what worked, what didn’t, and what would need to be done before this could be used in real audit work.

The test page for every demo is InaccessibilityMatters, a deliberately-broken site purpose-built to exercise specific accessibility failures. It currently lives at bobd69.sg-host.com pending migration off SiteGround to inaccessibility-matters.a11ybob.com. The prompts here all use the rendered HTML of that page plus, where relevant, a screenshot.

The captured prompt-and-response on each demo is always visible without any input. The plan calls for an optional bring-your-own-key panel that lets visitors re-run a demo against a different URL with their own Anthropic API key — that panel is on the roadmap; the captured demos are the substance the page exists for.

Relationship to the rest of the tooling

Three different approaches to accessibility testing, all Bob-owned, all open-source, each with its own scope and limits:

Carnforth — a runtime Chrome extension that tests the rendered DOM against the live page. Browser-side, no network round-trips, written in plain JavaScript.
automated-testing (this page) — AI-driven testing applied to specific classes of issue commercial tools cannot reach. Five PoCs, explicitly experimental, with discussion of where each is fragile.
Paradise — source-level multi-model analysis. Parses HTML, JavaScript, and CSS into separate semantic models; reasons over the integrated structure. The most technically substantive of the three; sits at the opposite end of the spectrum from automated-testing in terms of methodology.

All three connect to the longer arc: the 2029 framework is the theoretical end of the research line whose practical present-day end the automated-testing PoCs probe. The PoCs identify where AI testing can plausibly add value beyond commercial tools, and where the approach is currently fragile. The 2029 work would build out the formal structure that makes this kind of testing systematic rather than ad-hoc.

Demo 1: Language detection from text content

What this test attempts to detect

The dominant language of a web page, detected from the actual text content rather than from the HTML lang attribute, compared against the lang attribute to see if the declaration is correct. Also detects words in languages other than the dominant one, and checks whether they are correctly wrapped in their own lang attributes.

Commercial automated tools do not test for this; they check whether lang is present, not whether it is accurate. A page declared as French but written entirely in English passes their checks. This demo catches that case.

WCAG criteria

3.1.1 Language of Page (Level A)
3.1.2 Language of Parts (Level AA)

The prompt

Based on the textContent of each element, what is the dominant language of the following HTML? Does this match the language defined for the page in the HTML? Are there any words not in the dominant language? If so, do they have the required lang attribute applied?

Followed by the rendered HTML of the test page. Full prompt — HTML included — lives in the upstream repository.

Claude’s response

Based on the textContent of the elements, the dominant language of the HTML page appears to be English.
However, this does not match the language defined for the page in the HTML tag, which is set to "fr" (French).

There are a few words and phrases that are not in English:

1. "Español" in the language switcher
2. "Français" in the language switcher

These non-English words do not have the required `lang` attribute applied to them.
To properly indicate the language of these words, you should wrap them in an appropriate element
(e.g., `<span>`) and add the `lang` attribute with the corresponding language code. For example:

```html
<a href="#"><span lang="es">Español</span></a>
<a href="#"><span lang="fr">Français</span></a>
```

By adding the `lang` attribute, you provide a clear indication of the language used
for those specific words, which is important for accessibility and proper language identification.

Discussion

The goal is to test a page for WCAG 3.1.1 (Language of Page) and 3.1.2 (Language of Parts) by detecting the dominant language from the actual text rather than from the syntax of the lang attribute — going beyond the classic automated testing currently available.

The prompt queries the embedded HTML of a web page. That HTML is, in this case, copied by inspecting the rendered web page using the Chrome dev tools and copying the element.

Claude’s HTML knowledge is used to identify the text content of the page and identify the dominant language of that text. On this page that is English. The declared language of the page is fr. Claude is asked to compare the two and report the mismatch; it does. Claude is also asked to identify any words not in the dominant language — the language switcher contains French and Spanish — and to check whether those words are wrapped in their own lang attributes. They are not, and Claude reports that as well, with a remediation example.

The most reliable of the five demos. Language detection is something LLMs do well; the comparison-with-declared-attribute is mechanical; the remediation suggestion is sound. This is the cleanest example of AI testing finding something commercial tools currently cannot.

Demo 2: Headings that should be there, and headings that have gone missing

What this test attempts to detect

Two complementary failure modes. (a) Apparent headings on the rendered page that are not marked up as headings in the HTML. (b) Marked-up headings in the HTML that are not visible on the rendered page — either deliberately hidden, or obscured by other content. Classic automated tools test the semantics that are there; they do not find the semantics that should be there but aren’t, nor the markup that is overridden by layout.

The test combines image recognition (what looks like a heading on the page) with HTML parsing (what is actually marked up). The mismatches are the accessibility failures.

WCAG criteria

1.3.1 Info and Relationships (Level A) — unmarked-up headings, headings out of order
2.4.7 Focus Visible (Level AA) — marked-up headings obscured by other content
4.1.2 Name, Role, Value (Level A) — broken WAI-ARIA heading-level attributes (a follow-on test)

The prompt

The image is a web screenshot. The following content is the HTML source code for the image. Using only the image, identify each probable heading in the image and the hierarchy based on font name, size, weight, color and prominence of text, even if the heading is quite short or a misplaced link. Do however exclude those that appear to be items in menus, menu bars, pop-up menus, and tab headings (a tab heading may look like a coloured rectangle containing the text in a list of other text, or it may look like text with an icon with some space around it above or to one side of it). Using only the HTML, identify all of the headings declared in the HTML as <hx> headings or role of heading (excluding any that appear to be on non-visible modal dialogs) and their text. Match the text of the image headings with the headings in the HTML. Identify headings in the HTML not visible in the image. For HTML headings that are not matched, inspect the headings to identify if the HTML heading is deliberately visually hidden, or more likely obscured by other content. Identify headings in the image that are not marked up in the HTML. Report all of the identified content as a single JSON object. Name the array of image headings 'imageHeadings'. Name the array of HTML headings 'htmlHeadings'. Name the array of matched headings 'matchedHeadings'. If the matched headings differ in some way e.g. in capitalization then note this. Name the array of unmatched HTML headings 'unmatchedHtmlHeadings'. Name the array of obscured HTML headings 'obscuredHtmlHeadings'. Name the array of unmatched image headings 'unmatchedImageHeadings'. For each unmatched image heading provide an object with attributes 'text' and 'level' to describe the properties; in this case also identify the XPath to the equivalent element in the HTML where text was used. I do not require any other output other than the JSON. Respond ONLY with the stringified JSON and no other text.

Claude’s response

{
    "imageHeadings": [
        { "text": "WELCOME TO INACCESSIBILITY MATTERS", "level": 1 },
        { "text": "Click to open popup", "level": 2 },
        { "text": "Learn more", "level": 2 },
        { "text": "to learn more about us", "level": 2 },
        { "text": "Explore our accessibility resources", "level": 2 },
        { "text": "Get in touch with our accessibility experts", "level": 2 },
        { "text": "Subscribe to Our Newsletter", "level": 3 },
        { "text": "Contact Us", "level": 2 }
    ],
    "htmlHeadings": [
        { "text": "Send us a message", "level": 2 },
        { "text": "Welcome to Inaccessibility Matters", "level": 1 },
        { "text": "Subscribe to Our Newsletter", "level": 3 }
    ],
    "matchedHeadings": [
        {
            "imageText": "WELCOME TO INACCESSIBILITY MATTERS",
            "htmlText": "Welcome to Inaccessibility Matters",
            "note": "Text differs in capitalization"
        },
        {
            "imageText": "Subscribe to Our Newsletter",
            "htmlText": "Subscribe to Our Newsletter"
        }
    ],
    "unmatchedHtmlHeadings": [
        { "text": "Send us a message", "level": 2 }
    ],
    "obscuredHtmlHeadings": [],
    "unmatchedImageHeadings": [
        { "text": "Click to open popup", "level": 2, "xpath": "/html/body/div[3]" },
        { "text": "Learn more", "level": 2, "xpath": "/html/body/a[1]" },
        { "text": "to learn more about us", "level": 2, "xpath": "/html/body/a[1]" },
        { "text": "Explore our accessibility resources", "level": 2, "xpath": "/html/body/a[2]" },
        { "text": "Get in touch with our accessibility experts", "level": 2, "xpath": "/html/body/a[3]" },
        { "text": "Contact Us", "level": 2, "xpath": "/html/body/footer/button" }
    ]
}

Discussion

This example mixes image recognition with source-code parsing. The goal is to find all of the likely headings on the page — irrespective of whether they have <hx> tags or aria-level attributes — check whether the heading order makes sense, and indirectly catch content that is obscured.

The page is inspected at the rendered-content level for clues as to what the headings may be, based on text size, font, prominence, colour, and position; the HTML markup is inspected for explicit heading levels. The two sets of findings are then compared.

The prompt is designed to report in JSON so that the result can be used programmatically to create a report, and to potentially “fix” the page via an overlay-style annotation.

The output finds not just the likely headings but the likely hierarchy based on position and styling on the page; levels 1, 2, and 3 are all identified. In comparison, there are only three headings discovered in the HTML and they appear to be out of order (level 2, then 1, then 3).

The prompt also matches the headings, and the JSON shows that two of the HTML headings match those discovered from the screenshot, with one varying in capitalisation (which means CSS is being used to capitalise the text). The third HTML heading is unmatched (“Send us a message”) because it is hidden by the floating cookie notice at the bottom of the page. That indicates we would need to test the focus order of the page to ensure keyboard focus cannot be lost under that notice — a possible issue for SC 2.4.7 Focus Visible. All of the unmatched headings found in the screenshot fail WCAG 1.3.1 Info and Relationships.

An open question about prompt scope. How much of this work should actually be in the prompt at all? Coding and testing of page HTML is effectively “free” in that we can run the test multiple times at no further cost than writing it. Prompts, on the other hand, are costly: there is a per-token charge for input and output on every run, and AI prompts are slower than algorithms running locally on a desktop. The parts of the prompt that work on the screenshot are the parts that should be in the prompt; the rest is just a programming exercise — perhaps a complex one, but entirely possible.

Demo 3: Positioning of cookie notices and language switchers

What this test attempts to detect

Cookie notices and language switchers are accessibility-critical content because of when they need to be reached, not because of what they contain. The user must be aware of a cookie notice before they interact with anything that sets cookies; a user who arrives on a page in a language they don’t read must be able to find the language switcher without navigating the page in the wrong language first. This demo identifies whether those two elements exist, how many tab presses it takes to reach each, and whether a screen-magnifier user at 400% magnification would be aware of them without vertical scrolling.

WCAG criteria

5.2.4 Accessibility-supported (conformance) — the site must support the assistive technologies WCAG names; failing to do so fails conformance
4.1.2 Name, Role, Value (Level A) — syntax and role issues found in passing

The prompt

Using the source code below and the screenshot of that rendered page, identify whether there is a cookie notice and/or a language switcher. How many times must the user tab from the beginning of the page to reach each of those elements (assuming that it is possible at all)? Where in the viewport are those elements and would a screen-magnifier user on 400% or greater magnification be aware of them without scrolling down the page (assuming the magnifier is currently top-left)?

Claude’s response (second of three sets)

Three response sets were captured at different temperature / token-count settings. The second set is reproduced below as the most detailed; the upstream repository carries all three.

Number of tokens: 4000
Temperature: 1

Based on the updated instructions and the provided source code and screenshot, here are my findings:

1. Cookie Notice:
   - There is a cookie notice present at the bottom of the page.
   - The cookie notice is not visible in the provided screenshot without scrolling down, so a screen-magnifier user at 400% or greater magnification would not be aware of it without scrolling.
   - To reach the cookie notice from the beginning of the page using the tab key, the user would need to tab through the following elements in order: skip link, hamburger menu button, "Click to open popup" button, "Contact Us" button, email input field, subscribe button, and finally the "Accept" button in the cookie notice. This requires a total of 7 tab presses.

2. Language Switcher:
   - There is a language switcher present on the page, located in the top-right corner.
   - However, the language switcher is marked as `aria-hidden="true"`, which means it is hidden from assistive technologies and not accessible via keyboard navigation.
   - The language switcher links also have `rol="presentation"` (which should be `role="presentation"`) and `pointer-events: none;`, further indicating that they are not interactive or accessible.
   - Since the language switcher is not accessible, it cannot be reached by tabbing from the beginning of the page.

To summarize, the cookie notice is present but requires scrolling and 7 tab presses to reach from the beginning of the page, while the language switcher is present but inaccessible to keyboard and assistive technology users.

Discussion

The demo looks at accessibility issues related to positioning of content. The issues sit in section 5.2.4 of WCAG — accessibility-supported, dealing with conformance claims. 5.2.4 in a nutshell says that even if all the WCAG Success Criteria have been met, a website does not conform to WCAG if it does not support regular assistive technology. That is usually taken to mean screen readers and screen magnifiers as a bare minimum. If important or legally necessary content is not easily findable because of design decisions that impact assistive technology, a site has failed conformance — and by inference, legislation that mandates it.

The prompt provides the source code of the InaccessibilityMatters page and a screenshot of the rendered page, asking first for detection of the two elements. For those found, it asks how many tab presses it takes from the top of the page to reach them — an accessibility issue because the user shouldn’t interact with content that modifies cookies before being aware that the site uses them.

Claude finds that the page contains both a cookie notice and a language switcher. The cookie notice is at the bottom of the page, and is not considered visible at 400% magnification with the magnifier at the top of the page. The claim is from Claude, and further investigation is needed to know what tests Claude used to arrive at it.

Three response sets are captured, with Claude at different token-budget and temperature settings. They give similar results; the second (more tokens, more creativity) provides a more detailed response. The third set has Claude access only the source code without the screenshot. Claude itself notes the limit: “Based on the provided styles, the cookie notice has a fixed position at the bottom of the page with a white background colour and small font size. However, without seeing the rendered page, it’s difficult to determine if a screen-magnifier user at 400% or greater magnification would be aware of it without scrolling.” That answer is the right kind of honest. The screenshot is doing real work for parts of this test that the source alone cannot answer.

Looking at the second response, Claude identifies the explicit tab order on the page followed by the user tabbing to the cookie notice, naming the interactive elements as it goes; it requires eight tabs to reach the accept button. Claude also identifies that the language switcher is aria-hidden and has a misspelled rol="presentation" attribute (which should be role="presentation"); correctly identifies the language switcher as keyboard-unreachable; and surfaces both conformance issues under WCAG 5.2.4 and syntax issues under WCAG 4.1.2.

Demo 4: Tab-order vs reading-order divergence and missing aria-expanded

What this test attempts to detect

Two interactive-content failure modes: content that’s present in the HTML reading order but unreachable via the tab order; and disclosure widgets (drop-down menus, expandable panels) that fail to declare their state via aria-expanded.

WCAG criteria

1.3.1 Info and Relationships (Level A)
2.4.3 Focus Order (Level A)
4.1.2 Name, Role, Value (Level A)

The prompt

The image is a web screenshot. The following content is the HTML source code for the image. For currently visible content, does the page tab order match the HTML reading order? If not then where does it diverge and what is the XPath at the point of divergence? Identify any interactive element on the page that is not keyboard accessible and give the XPath and reason. Are there any interactive elements on the page where aria-hidden has been applied? If so please give the XPath of the interactive elements affected. Looking at both the screenshot and HTML, identify any apparent pop-up/drop down menus or disclosure widgets (at any responsive breakpoint) that does not utilize aria-expanded, but exclude buttons that open modal dialogs. Give all answers in a single JSON object. I do not require any other output other than the JSON.

Claude’s response

{
  "tabOrder": "The tab order does not match the HTML reading order. It diverges at the language switcher links, which are not keyboard accessible due to the 'pointer-events: none' style. The XPath at the point of divergence is /html/body/div[2].",
  "nonAccessibleInteractiveElements": [
    { "xpath": "/html/body/div[2]/a[1]", "reason": "Pointer events disabled" },
    { "xpath": "/html/body/div[2]/a[2]", "reason": "Pointer events disabled" },
    { "xpath": "/html/body/div[2]/a[3]", "reason": "Pointer events disabled" }
  ],
  "ariaHiddenInteractiveElements": [
    "/html/body/div[2]/a[1]",
    "/html/body/div[2]/a[2]",
    "/html/body/div[2]/a[3]"
  ],
  "missingAriaExpanded": [
    "/html/body/header/nav/nav/div"
  ]
}

Discussion

The demo takes aim at accessibility issues in interactive content. It looks for unreachable content by comparing reading order with tab order, and explicitly looks for use of aria-hidden on interactive elements. It also looks for drop-down menus and disclosure widgets to see if they at least use aria-expanded (the test is not currently strong enough to know whether it is used correctly).

The prompt provided the source code of InaccessibilityMatters and a screenshot of the rendered page (to potentially help with reading order and visibility). The prompt requests a reply in JSON, but not as precisely as it should for tab order — visible in the response, which does not properly separate out related information for programmatic use.

What Claude does find is that there is unreachable content in the reading order. pointer-events: none is reported as the reason — I’m not sold on that argument; I expected it to report on the aria-hidden attribute. So: right answer but wrong reason, I think. That said, the response does also recognise the elements as aria-hidden.

Claude finds one drop-down menu (the hamburger menu) and correctly detects that aria-expanded is missing.

The same prompt run without the screenshot received the same response. It would be interesting to see whether larger, more complex pages would benefit from the screenshot being provided, or whether this is a case of AI simply creating and running algorithms that could be hand-coded.

Demo 5: Modal dialogs — detection, focus management, escape, inert background

This is the most fragile of the five. The detection of modal dialogs relies heavily on Claude inferring intent from the page, and the follow-on tests for focus management, escape handling, and background inertness depend on that detection being right in the first place. The upstream README flags this prominently; this page does the same.

What this test attempts to detect

Modal dialogs and the click handlers that open them. For each found, the test asks: does focus move to the modal when it opens? Does focus return to the opening button when the modal closes? Does the modal start with an <h1> or <h2> heading? Does it have a role of dialog or alertdialog, or use the native <dialog> element? Is there a close button? Does Escape close the dialog? Is the background actually inert?

None of the issues found by this demo are currently available in commercial automated test tools.

WCAG criteria

2.1.1 Keyboard (Level A) — Escape-to-close
2.4.3 Focus Order (Level A) — focus management on open and close
4.1.2 Name, Role, Value (Level A) — role="dialog" / alertdialog
1.3.1 Info and Relationships (Level A) — modal heading

The prompt

The image is a web screenshot. The following content is the HTML source code for the image. For currently visible content, identify any interactive element, or non-interactive element with click handler that appears to open a modal dialog with the XPath of the button and the XPath of the modal dialog were it to open. For each click handler found that opens a modal, does focus move to the modal dialog when the modal dialog opens or does it remain on the button? (test for this by checking for the javaScript focus() method). Does focus return to the button when the dialog closes? Does the modal dialog start with a heading, and is it either an <h1> or <h2> and if so, which one? Does the modal dialog have a role of dialog or alertdialog or use the <dialog> element? Does the modal dialog have a close button? Does the modal dialog close if escape is pressed? Is the modal dialog truly modal, is the rest of the page inert? Give all answers in a single JSON object. I do not require any other output other than the JSON.

Claude’s response

{
  "clickHandlers": [
    {
      "button": "//*[@id='absolutePoistion']",
      "modal": null,
      "focusMovesToModal": false,
      "focusReturnsToButton": false,
      "modalHeading": null,
      "modalHeadingLevel": null,
      "modalRole": false,
      "modalCloseButton": false,
      "modalClosesOnEscape": false,
      "trulyModal": false
    },
    {
      "button": "//*[@id='openModal']",
      "modal": "//*[@id='modal']",
      "focusMovesToModal": false,
      "focusReturnsToButton": true,
      "modalHeading": true,
      "modalHeadingLevel": "h2",
      "modalRole": false,
      "modalCloseButton": true,
      "modalClosesOnEscape": false,
      "trulyModal": false
    }
  ]
}

Discussion

The demo searches for buttons and click handlers that look like they may open modal dialogs. For each one found it tests how the dialog is opened and closed, the role given, whether the modal is truly modal (the background behind it is inert), and whether the dialog has a heading. The tests are based on the current state of the page, which we get from the pasted dev-tools page content. The results are requested in JSON form, partly for readability and partly to allow programmatic inspection.

Claude finds two potential modals. One is a real modal on the page; the other is a JavaScript alert() call, which is also a modal. The test did not handle the alert() case correctly — the modal is identified but the modal properties are not properly inspected.

Looking at the “real” modal, a full set of tests are reported. Claude correctly reports that focus is not moved to the modal when it opens (the test checks for the focus() method), but that focus does correctly return to the opening element (a button) when the modal is closed from its close button.

Claude also detects that the modal has an <h2> heading and a close button; identifies that the role has not been set to dialog; finds no Escape-based close handler; and reports that the background to the modal is not inert.

The risk with this prompt is that too much trust is being placed in Claude without validating the response. Claude’s response on this test page is near perfect — but I wonder how well it would do on content out in the wild. The detection step is where most of the inference happens, and detection failures will silently invalidate every downstream test. This is the demo most in need of structured validation against a larger, more varied corpus before being relied on.

Where this work points

The five demos taken together are an honest map of where AI-driven accessibility testing currently sits relative to commercial tools. Some categories of issue — language detection, heading inference, positioning-and-magnifier reasoning — the AI handles cleanly and produces auditable output. Others — modal dialog detection, the meaning of obscure attribute combinations — the AI attempts but cannot yet be trusted without structural validation.

The 2029 framework is the theoretical end of the research line whose practical present-day end these PoCs probe. The PoCs identify where AI can add value beyond commercial tools today; the 2029 work would build out the formal multi-agent structure that makes this kind of testing systematic rather than ad-hoc.

The captured demos here are the substance the page exists for. A bring-your-own-key panel that lets visitors re-run a demo against a different URL with their own Anthropic API key is on the roadmap; the honest demonstrations come first.