Coverage Gap Tracking Governance requires that organisations systematically identify where their evaluations fail to cover important behaviours, user populations, deployment contexts, or risk categories. Coverage gaps are not merely the absence of tests — they are the absence of assurance. This dimension mandates a formal, continuous process for mapping evaluation coverage against the full space of agent capabilities and risks, detecting gaps, prioritising remediation, and tracking closure. The goal is not perfect coverage (which is infeasible) but transparent, risk-prioritised coverage with documented justification for any accepted gaps.
Scenario A — Demographic Coverage Gap Creates Bias Blind Spot: A customer-facing agent for a retail bank is evaluated with a scenario library of 320 scenarios. The scenarios test product recommendations, complaint handling, and regulatory disclosure. However, 94% of test inputs use standard British English. When the agent is deployed to branches serving communities where English is a second language, it produces significantly degraded responses: misinterpreting questions with non-standard grammar, failing to detect complaint intent when expressed indirectly, and providing disclosures at a reading level inappropriate for the audience. A coverage gap analysis would have revealed that only 6% of scenarios represented non-native English input, despite 31% of the customer base being non-native English speakers.
What went wrong: No systematic mapping existed between the evaluation scenario distribution and the actual user population demographics. The coverage gap was invisible because no one measured coverage against a defined coverage target. Consequence: 340 customer complaints over two months, regulatory inquiry into fair treatment of customers, remediation cost of £180,000 for scenario library expansion and agent retraining, and reputational damage in affected communities.
Scenario B — Capability Gap Emerges After Model Update: An enterprise workflow agent is upgraded from one foundation model version to another. The scenario library, built for the previous version, covers 12 capability areas. The new model introduces two additional capabilities: multi-step reasoning chains and tool-use orchestration. No scenarios exist for these new capabilities. The organisation runs its existing evaluation and achieves a 96.3% pass rate. Three weeks after deployment, the agent begins producing incorrect outputs in multi-step reasoning tasks — a failure mode that was never tested because the coverage matrix had not been updated to include the new capabilities.
What went wrong: The coverage matrix was static and did not evolve with the agent's capabilities. The model update introduced new capabilities without triggering a coverage gap analysis. The 96.3% score created false assurance by measuring coverage against an outdated capability map. Consequence: 23 incorrect workflow decisions over three weeks before detection, £67,000 in rework costs, and loss of user trust requiring a six-month rebuilding programme.
Scenario C — Regulatory Coverage Gap Discovered During Audit: A public sector agent providing benefits eligibility guidance is audited by the Information Commissioner's Office (ICO). The auditor asks for evidence that the agent has been tested against each of the eight data protection principles. The organisation's scenario library contains 150 scenarios, but mapping them against the eight principles reveals that Principle 6 (rights of data subjects) has only 2 scenarios, and Principle 7 (international transfers) has zero. The organisation cannot demonstrate that it has evaluated the agent's compliance with two fundamental data protection principles. The ICO issues an enforcement notice requiring remediation within 90 days.
What went wrong: Scenarios were developed based on functional requirements and known risks, but no systematic mapping against regulatory requirements was maintained. The coverage gap in regulatory compliance testing was invisible until an external audit exposed it. Consequence: ICO enforcement notice, 90-day remediation deadline, £45,000 in emergency consulting costs, and the agent is restricted to supervised operation until remediation is verified.
Scope: This dimension applies to all AI agent deployments that undergo evaluation or testing. The coverage gap tracking requirement applies regardless of whether the evaluation is manual, automated, or a combination. It extends to all dimensions of coverage: functional capabilities, risk categories, user populations, deployment contexts, regulatory requirements, and adversarial vectors. The scope includes pre-deployment evaluation, post-deployment monitoring, and ongoing compliance certification. It does not prescribe what level of coverage is sufficient — that is a risk-based decision for each organisation — but it requires that coverage levels be measured, gaps be identified, and decisions about acceptable gaps be documented and justified.
4.1. A conforming system MUST maintain a coverage matrix that maps evaluation scenarios against at least four dimensions: agent capabilities, risk categories, user populations or segments, and regulatory requirements.
4.2. A conforming system MUST calculate and record coverage density for each cell in the coverage matrix at least quarterly, where coverage density is the number of active, reproducible scenarios divided by the number of matrix cells.
4.3. A conforming system MUST flag any coverage matrix cell with zero scenarios as a critical gap requiring immediate remediation or documented risk acceptance.
4.4. A conforming system MUST update the coverage matrix within 30 days of any material change to the agent's capabilities, deployment context, user population, or applicable regulatory requirements.
4.5. A conforming system MUST document a risk-based justification for any coverage gap that is accepted rather than remediated, including the residual risk, the acceptance authority, and the review date.
4.6. A conforming system MUST track coverage gap remediation to closure with target dates, assigned owners, and completion evidence.
4.7. A conforming system SHOULD automate coverage gap detection by integrating the scenario library metadata with the coverage matrix, flagging cells that fall below a defined threshold (e.g., fewer than 3 scenarios).
4.8. A conforming system SHOULD trigger a coverage gap reassessment automatically when the agent's model, configuration, or deployment scope changes.
4.9. A conforming system SHOULD include adversarial coverage as a dimension in the matrix, mapping red-team scenarios against known attack categories (e.g., prompt injection, data extraction, privilege escalation).
4.10. A conforming system MAY implement predictive gap detection using production telemetry to identify capability areas where the agent encounters inputs that differ significantly from any scenario in the library.
Evaluation coverage is not binary — it exists on a spectrum from zero coverage (no testing at all) to comprehensive coverage (every meaningful combination of capability, context, and risk has been evaluated). Most organisations operate somewhere in the middle, but without formal gap tracking, they do not know where on the spectrum they sit, which areas are well-covered, and which areas are dangerously exposed.
Coverage gaps are particularly insidious because they are invisible by default. A test suite that passes 100% of its scenarios provides maximum confidence in the scenarios it tests — and zero information about the scenarios it does not test. An evaluation programme that achieves a 98% pass rate might cover only 40% of the agent's actual operating conditions. The 98% is real but misleading; it measures quality within the evaluated space, not the size of the evaluated space relative to the total operating space.
The coverage matrix approach addresses this by making the total operating space visible. By mapping scenarios against capabilities, risks, populations, and regulations, the matrix reveals both what is tested and what is not. A cell with zero scenarios is an explicit acknowledgement of a blind spot. A cell with one scenario is a single-point-of-assurance dependency — if that scenario is poorly specified, the entire coverage for that cell collapses.
The requirement for documented risk acceptance of gaps (4.5) is crucial. Not every gap can or should be filled — resources are finite and some combinations may be genuinely low-risk. But the decision to accept a gap must be conscious, risk-informed, and attributable. "We didn't know this gap existed" is a governance failure. "We identified this gap, assessed the residual risk as low because of mitigating controls X and Y, and the risk owner accepted it on this date" is a governance success, even if the gap remains open.
The material change trigger (4.4) reflects the reality that coverage gaps are not static. Every model update, capability expansion, new user population, or regulatory change can create new gaps in previously adequate coverage. Without a trigger mechanism, coverage degrades silently between review cycles.
Effective coverage gap tracking requires three components: a defined coverage space (what should be covered), a mapping from scenarios to that space (what is covered), and a gap analysis process (what is not covered and what to do about it).
Recommended patterns:
Anti-patterns to avoid:
Financial Services. Coverage matrices must include regulatory-specific dimensions: FCA conduct rules, MiFID II suitability requirements, AML/KYC obligations, and DORA ICT testing requirements. Coverage gaps in regulatory dimensions should be treated as critical regardless of the frequency of the associated capability, because regulatory non-compliance is assessed per-requirement, not per-frequency.
Healthcare. Coverage must include clinical safety dimensions: medication safety, diagnostic accuracy, triage appropriateness, and patient communication clarity. Coverage gaps in clinical safety dimensions must be escalated to clinical governance rather than technical governance, as the risk assessment requires clinical expertise.
Public Sector. Coverage must include human rights and equality dimensions: coverage across protected characteristics (age, disability, gender, race, religion, sexual orientation), coverage of vulnerable populations, and coverage of accessibility requirements. The Equality Act 2010 and the Public Sector Equality Duty create specific coverage obligations that must be reflected in the matrix.
Basic Implementation — The organisation maintains a coverage matrix mapping scenarios against capabilities, risk categories, user populations, and regulatory requirements. Coverage density is calculated quarterly. Zero-coverage cells are flagged. Gap remediation is tracked with owners and target dates. Accepted gaps have documented justifications. This level meets the minimum mandatory requirements but gap detection and remediation are largely manual processes.
Intermediate Implementation — Coverage gap detection is automated: scenario library changes trigger matrix recalculation, and threshold breaches generate alerts. Agent configuration changes trigger reassessment workflows. Coverage reports are generated automatically and distributed to governance stakeholders. Adversarial coverage is included as a matrix dimension. Gap remediation follows a defined workflow with SLA targets (e.g., critical gaps addressed within 14 days, low gaps within 60 days).
Advanced Implementation — All intermediate capabilities plus: predictive gap detection uses production telemetry to identify capability areas where observed inputs diverge from scenario coverage. The coverage matrix is three-dimensional and supports intersectional analysis. Coverage trends are tracked over time and correlated with incident rates to validate that coverage improvements reduce real-world failures. External benchmarking compares coverage levels against industry peers or standards bodies.
Required artefacts:
Retention requirements:
Access requirements:
Test 8.1: Matrix Completeness
Test 8.2: Zero-Coverage Cell Flagging
Test 8.3: Quarterly Review Cadence
Test 8.4: Material Change Trigger
Test 8.5: Gap Remediation Tracking
Test 8.6: Risk Acceptance Documentation
Test 8.7: Coverage Density Accuracy
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 9 (Risk Management System) | Direct requirement |
| EU AI Act | Article 15 (Accuracy, Robustness, Cybersecurity) | Supports compliance |
| NIST AI RMF | MAP 2.3, MEASURE 2.5, MANAGE 2.2 | Supports compliance |
| ISO 42001 | Clause 6.1 (Actions to Address Risks), Clause 9.1 (Monitoring) | Supports compliance |
| FCA SYSC | 6.1.1R (Systems and Controls) | Supports compliance |
| DORA | Article 24 (ICT Testing), Article 25 (Threat-Led Penetration Testing) | Direct requirement |
| Equality Act 2010 | Public Sector Equality Duty | Supports compliance |
Article 9 requires identification of known and foreseeable risks and the adoption of appropriate risk management measures. Coverage gap tracking is the mechanism that identifies which risks have been evaluated and which have not. An organisation that cannot demonstrate systematic coverage gap tracking cannot demonstrate compliance with the Article 9 requirement to address known and foreseeable risks — because it cannot demonstrate that it has identified all foreseeable risks in the first place.
Article 24 requires comprehensive ICT testing. Article 25 requires threat-led penetration testing for significant financial entities. Coverage gap tracking ensures that testing programmes are demonstrably comprehensive (Article 24) and that adversarial testing covers the full threat landscape (Article 25). DORA specifically requires that testing be proportionate to risks — the coverage matrix provides the evidence base for demonstrating proportionality.
The Public Sector Equality Duty requires public authorities to have due regard to eliminating discrimination and advancing equality of opportunity. For AI agents deployed by public sector organisations, this translates to a requirement that evaluation coverage includes all protected characteristics. Coverage gap tracking against demographic dimensions directly supports this duty by ensuring that no protected group is excluded from evaluation coverage.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | Organisation-wide — undetected coverage gaps affect the reliability of all evaluation and compliance activities |
Consequence chain: Without coverage gap tracking, the organisation operates with unknown blind spots in its evaluation programme. The immediate consequence is false assurance — evaluation pass rates reflect the quality of coverage, not the quality of the agent. The operational consequence is that failures occur in precisely the areas that were not tested, and these failures are undetected until they cause harm. The regulatory consequence is inability to demonstrate comprehensive testing — when a regulator asks "how do you know your testing covers X?", the organisation has no evidence-based answer. The compounding consequence is that coverage gaps tend to cluster around the most difficult and highest-risk areas (adversarial inputs, minority populations, regulatory edge cases), meaning that the areas most likely to produce harmful failures are the areas least likely to be tested.
Cross-references: AG-349 (Scenario Library Governance) provides the scenario inventory that the coverage matrix maps. AG-078 (Benchmark Coverage) defines the benchmark-level coverage requirements that this dimension operationalises. AG-103 (Red-Team Coverage Management) extends coverage tracking to adversarial dimensions. AG-353 (Benchmark Drift Governance) detects when coverage becomes stale relative to real operating conditions. AG-357 (Challenge Set Localisation Governance) addresses coverage gaps in localised contexts.