A routing map for testing WCAG 2.1.1, 2.1.2 & 1.3.1

What this is

Automated tools catch part of WCAG and people handle the rest — that part is well known. This is an honest routing map of which part is which, for the three criteria a from-scratch automation effort tends to get stuck on (2.1.1, 2.1.2, 1.3.1): what a deterministic tool can own, what an AI model can usefully propose for a human to confirm, and what stays a human call. Every placement is tagged by how well today's evidence backs it. No grand claim — a map you can act on, with the receipts and the gaps marked.

The three criteria

2.1.1 Keyboard — “All functionality of the content is operable through a keyboard interface without requiring specific timings for individual keystrokes”¹.
2.1.2 No Keyboard Trap — once focus enters a component, “…then focus can be moved away from that component using only a keyboard interface…”².
1.3.1 Info & Relationships — “Information, structure, and relationships conveyed through presentation can be programmatically determined or are available in text”³.

The routing map

This map is the brief's synthesis from the cited sources (not a quotation); the tags rate how well current evidence backs each placement.
Criterion / layer	A deterministic tool can own	AI proposes → human confirms	Human-led
2.1.2 No Keyboard Trap	Deterministically automatable — but only with a focus-driving / AT-driver harness (tab through, confirm you can't get stuck), not the static axe-core / Equal Access engines, which catch only DOM-inferable traps.		Unusual plug-in / embedded boundaries.
2.1.1 Keyboard	Confirm an element is reachable, and — by driving it at runtime — activatable; static engines catch only DOM-inferable problems.	Enumerate a custom widget's expected keys (per its role) for a harness or human to verify.	Whether all functionality has a keyboard path and honours the role-appropriate key contract — a generic script can't know the intended role.
Companions — 2.4.7 Focus Visible, 2.4.3 Focus Order, 4.1.2 Name/Role/Value	A focus style is declared (2.4.7); a name/role/value is present (4.1.2).		Whether the focus indicator is actually visible (2.4.7); whether focus order is meaningful (2.4.3); whether the accessible name is the right name.
1.3.1 — markup present	A heading is marked up as a heading, a label is tied to its input, a cell has a header.
1.3.1 — relationship correct	Check that a `<th>` / `headers`/`id` association exists and is referentially valid — but not whether it scopes the right cells.	Propose the likely-correct association from context.	Confirm it.
1.3.1 — intent matches the visual meaning	Out of reach for a rule engine.	Propose structure from the rendered layout — unvalidated (see Research directions).	Decide.

✓ proven (tool docs / standards) ◐ partial / case-by-case ⚠ promising but unvalidated ✗ not feasible for a rule engine – not applicable

Reading the map

Three honest distinctions hold it together:

2.1.2 is not 2.1.1. Detecting a keyboard trap is close to deterministic — given a harness that actually drives focus; a static scan alone can't. But “all functionality works by keyboard” is harder: a script can confirm an element is reachable, yet it can't know a custom widget's intended interaction model — composite components manage focus with a roving tabindex (“When using roving tabindex to manage focus in a composite UI component…”⁸), and which arrow / Escape keys should do what differs by role (menu vs grid vs tree). Those patterns come from the ARIA Authoring Practices Guide, which is non-normative: diverging from them isn't automatically a 2.1.1 failure — 2.1.1 fails only when missing keys leave functionality keyboard-inoperable. Keyboard operability is also necessary, not sufficient: pair it with focus order (2.4.3), visible focus (2.4.7) and name/role/value (4.1.2).
1.3.1 is three layers, not two. A tool can verify a structural element is present; whether the relationship is correct (does this header scope the right cells?) is partly machine-checkable at best, and whether the markup matches the intended visual meaning is a judgment. Layers two and three are most of 1.3.1's real failures — so the genuinely-deterministic share of 1.3.1 is modest.
“Own” doesn't mean “complete.” Even where a deterministic tool owns a check, it finds issues only where a rule fires — “Absence of detected errors does not indicate that a page is accessible or conformant.”⁵, and “many accessibility problems can only be discovered through manual testing”⁴.

Where AI actually helps — and where it only looks like it does

An accepted assist — with the output still unvalidated

Once a deterministic tool localises an issue, an AI model can draft a candidate fix — alt-text, a corrected label, a clearer error message. This is an accepted human-in-the-loop workflow, but a sound workflow is not the same as a correct output: a human must confirm each draft is both accurate and contextually appropriate before it ships. Alt-text is the weakest case — AI descriptions are context-dependent and reliably plausible-but-wrong, the same silent-failure mode flagged below. AI is the assistant on a localised finding, never the detector of record and never the final word.

Promising but unproven: AI judging structure

The tempting move is a vision-language model proposing 1.3.1 structure from the rendered page. The capability to read UI structure exists — ScreenAI is “a vision-language model that specializes in UI and infographics understanding”¹² — but that is extraction, not adjudication, and no source here validates a model's accuracy at judging whether markup matches intended meaning. AI's semantic output also fails silently (a confident, wrong alt text slips past a glance). So treat this as a research direction to pilot with measurement, not a tool to trust: take a labelled set of pages where structure does and doesn't match the visual meaning, have the model propose, and measure agreement against expert raters. Until that exists, it stays tagged ⚠.

What the data does — and doesn't — say. The systematic review centres on text and structure (“most studies apply LLMs to text-centric and structurally explicit accessibility tasks, with WCAG serving as the primary reference framework and limited consideration of cognitive accessibility guidelines (COGA)”⁶), and its issue table records studies touching alt-text most, then contrast and name/role/value, and — notably — keyboard (2.1.1) and heading structure (1.3.1) too.⁷ Read that as research attention spanning these criteria (so this isn't a text-only frontier) — not as evidence AI succeeds at them: the studies' actual results weren't verified here, and efficacy is precisely the open gap. Treat the counts as approximate rank-order (the review is even internally inconsistent on its own total), and note the same table logs hallucinated image descriptions in several studies — the silent-failure risk, in the data.

The tools, honestly bounded

You build on a mature stack, not a blank page — but each layer has edges worth stating:

Deterministic engines — axe-core and its test-runner integrations, IBM Equal Access (“tools to automate accessibility checking from a browser or in a continuous development/build environment”⁹), Playwright to drive the keyboard. Excellent where a rule fires; not a completeness guarantee (above).
Driving the screen reader — Guidepup (“Screen reader driver for test automation”¹¹) and the W3C AT Driver, which “AT Driver defines a protocol for introspection and remote control of assistive technology software”¹⁰. This captures what a screen reader announces — genuinely on-point for the behavioural criteria — but judging whether the announcement is adequate is itself a human call, and the layer is emerging, not turnkey (AT Driver is a draft; Guidepup is platform-bound).
The human-led floor — W3C's evaluation methodology (WCAG-EM, a Working-Group Note) frames evaluation so that “most accessibility checks are not fully automatable, evaluation tools can significantly assist evaluators”¹³: human-led, tool-assisted.

Research directions & open gaps

The 1.3.1-intent experiment above is the missing measurement: no published accuracy or inter-rater validation exists for an AI judging whether structure matches meaning.
Screen-reader-driver automation is emerging; coverage across assistive-technology / browser combinations is uneven.
Cognitive accessibility (COGA) is barely covered in the LLM literature.⁶
No controlled study measures how much manual time any of this actually saves.

Limitations & scope

Scoped to the WCAG 2.1 success criteria cited here; if your target is WCAG 2.2, re-check the specifics against that version.
The routing tags are this brief's assessment from the cited sources, not a measurement.
Several sources are preprints; tool capabilities are from the vendors' and standards' own docs (none benchmarked here); the study counts are approximate. State of the field as of June 2026.

Evidence register

Every quotation was re-checked, verbatim, against its captured source at build time. Each source is labelled by type; “preprint” means not yet peer-reviewed.

1. “All functionality of the content is operable through a keyboard interface without requiring specific timings for individual keystrokes” — WCAG 2.1 — SC 2.1.1 Keyboard (Level A) Normative (WCAG)
2. “then focus can be moved away from that component using only a keyboard interface” — WCAG 2.1 — SC 2.1.2 No Keyboard Trap (Level A) Normative (WCAG)
3. “Information, structure, and relationships conveyed through presentation can be programmatically determined or are available in text” — WCAG 2.1 — SC 1.3.1 Info and Relationships (Level A) Normative (WCAG)
4. “many accessibility problems can only be discovered through manual testing” — Playwright — Accessibility testing (docs) Tool documentation
5. “Absence of detected errors does not indicate that a page is accessible or conformant.” — WebAIM Million 2026 Empirical report
6. “most studies apply LLMs to text-centric and structurally explicit accessibility tasks, with WCAG serving as the primary reference framework and limited consideration of cognitive accessibility guidelines (COGA)” — LLMs for Web Accessibility: a Systematic Literature Review (2026) Peer-reviewed
7. “Keyboard navigation / tabindex issues” — LLMs for Web Accessibility: Systematic Review — issue-frequency table Peer-reviewed
8. “When using roving tabindex to manage focus in a composite UI component” — ARIA Authoring Practices Guide — Developing a Keyboard Interface W3C standards / methodology
9. “tools to automate accessibility checking from a browser or in a continuous development/build environment” — IBM Equal Access Accessibility Checker Tool documentation
10. “AT Driver defines a protocol for introspection and remote control of assistive technology software” — W3C AT Driver (draft protocol) W3C standards / methodology
11. “Screen reader driver for test automation” — Guidepup — screen reader driver for test automation Tool documentation
12. “a vision-language model that specializes in UI and infographics understanding” — ScreenAI: A Vision-Language Model for UI and Infographics Understanding Preprint (not peer-reviewed)
13. “most accessibility checks are not fully automatable, evaluation tools can significantly assist evaluators” — W3C WCAG-EM 1.0 (Evaluation Methodology, Working-Group Note) W3C standards / methodology