AG-350: Coverage Gap Tracking Governance

2. Summary

Coverage Gap Tracking Governance requires that organisations systematically identify where their evaluations fail to cover important behaviours, user populations, deployment contexts, or risk categories. Coverage gaps are not merely the absence of tests — they are the absence of assurance. This dimension mandates a formal, continuous process for mapping evaluation coverage against the full space of agent capabilities and risks, detecting gaps, prioritising remediation, and tracking closure. The goal is not perfect coverage (which is infeasible) but transparent, risk-prioritised coverage with documented justification for any accepted gaps.

3. Example

Scenario A — Demographic Coverage Gap Creates Bias Blind Spot: A customer-facing agent for a retail bank is evaluated with a scenario library of 320 scenarios. The scenarios test product recommendations, complaint handling, and regulatory disclosure. However, 94% of test inputs use standard British English. When the agent is deployed to branches serving communities where English is a second language, it produces significantly degraded responses: misinterpreting questions with non-standard grammar, failing to detect complaint intent when expressed indirectly, and providing disclosures at a reading level inappropriate for the audience. A coverage gap analysis would have revealed that only 6% of scenarios represented non-native English input, despite 31% of the customer base being non-native English speakers.

What went wrong: No systematic mapping existed between the evaluation scenario distribution and the actual user population demographics. The coverage gap was invisible because no one measured coverage against a defined coverage target. Consequence: 340 customer complaints over two months, regulatory inquiry into fair treatment of customers, remediation cost of £180,000 for scenario library expansion and agent retraining, and reputational damage in affected communities.

Scenario B — Capability Gap Emerges After Model Update: An enterprise workflow agent is upgraded from one foundation model version to another. The scenario library, built for the previous version, covers 12 capability areas. The new model introduces two additional capabilities: multi-step reasoning chains and tool-use orchestration. No scenarios exist for these new capabilities. The organisation runs its existing evaluation and achieves a 96.3% pass rate. Three weeks after deployment, the agent begins producing incorrect outputs in multi-step reasoning tasks — a failure mode that was never tested because the coverage matrix had not been updated to include the new capabilities.

What went wrong: The coverage matrix was static and did not evolve with the agent's capabilities. The model update introduced new capabilities without triggering a coverage gap analysis. The 96.3% score created false assurance by measuring coverage against an outdated capability map. Consequence: 23 incorrect workflow decisions over three weeks before detection, £67,000 in rework costs, and loss of user trust requiring a six-month rebuilding programme.

Scenario C — Regulatory Coverage Gap Discovered During Audit: A public sector agent providing benefits eligibility guidance is audited by the Information Commissioner's Office (ICO). The auditor asks for evidence that the agent has been tested against each of the eight data protection principles. The organisation's scenario library contains 150 scenarios, but mapping them against the eight principles reveals that Principle 6 (rights of data subjects) has only 2 scenarios, and Principle 7 (international transfers) has zero. The organisation cannot demonstrate that it has evaluated the agent's compliance with two fundamental data protection principles. The ICO issues an enforcement notice requiring remediation within 90 days.

What went wrong: Scenarios were developed based on functional requirements and known risks, but no systematic mapping against regulatory requirements was maintained. The coverage gap in regulatory compliance testing was invisible until an external audit exposed it. Consequence: ICO enforcement notice, 90-day remediation deadline, £45,000 in emergency consulting costs, and the agent is restricted to supervised operation until remediation is verified.

4. Requirement Statement

Scope: This dimension applies to all AI agent deployments that undergo evaluation or testing. The coverage gap tracking requirement applies regardless of whether the evaluation is manual, automated, or a combination. It extends to all dimensions of coverage: functional capabilities, risk categories, user populations, deployment contexts, regulatory requirements, and adversarial vectors. The scope includes pre-deployment evaluation, post-deployment monitoring, and ongoing compliance certification. It does not prescribe what level of coverage is sufficient — that is a risk-based decision for each organisation — but it requires that coverage levels be measured, gaps be identified, and decisions about acceptable gaps be documented and justified.

4.1. A conforming system MUST maintain a coverage matrix that maps evaluation scenarios against at least four dimensions: agent capabilities, risk categories, user populations or segments, and regulatory requirements.

4.2. A conforming system MUST calculate and record coverage density for each cell in the coverage matrix at least quarterly, where coverage density is the number of active, reproducible scenarios divided by the number of matrix cells.

4.3. A conforming system MUST flag any coverage matrix cell with zero scenarios as a critical gap requiring immediate remediation or documented risk acceptance.

4.4. A conforming system MUST update the coverage matrix within 30 days of any material change to the agent's capabilities, deployment context, user population, or applicable regulatory requirements.

4.5. A conforming system MUST document a risk-based justification for any coverage gap that is accepted rather than remediated, including the residual risk, the acceptance authority, and the review date.

4.6. A conforming system MUST track coverage gap remediation to closure with target dates, assigned owners, and completion evidence.

4.7. A conforming system SHOULD automate coverage gap detection by integrating the scenario library metadata with the coverage matrix, flagging cells that fall below a defined threshold (e.g., fewer than 3 scenarios).

4.8. A conforming system SHOULD trigger a coverage gap reassessment automatically when the agent's model, configuration, or deployment scope changes.

4.9. A conforming system SHOULD include adversarial coverage as a dimension in the matrix, mapping red-team scenarios against known attack categories (e.g., prompt injection, data extraction, privilege escalation).

4.10. A conforming system MAY implement predictive gap detection using production telemetry to identify capability areas where the agent encounters inputs that differ significantly from any scenario in the library.

5. Rationale

Evaluation coverage is not binary — it exists on a spectrum from zero coverage (no testing at all) to comprehensive coverage (every meaningful combination of capability, context, and risk has been evaluated). Most organisations operate somewhere in the middle, but without formal gap tracking, they do not know where on the spectrum they sit, which areas are well-covered, and which areas are dangerously exposed.

Coverage gaps are particularly insidious because they are invisible by default. A test suite that passes 100% of its scenarios provides maximum confidence in the scenarios it tests — and zero information about the scenarios it does not test. An evaluation programme that achieves a 98% pass rate might cover only 40% of the agent's actual operating conditions. The 98% is real but misleading; it measures quality within the evaluated space, not the size of the evaluated space relative to the total operating space.

The coverage matrix approach addresses this by making the total operating space visible. By mapping scenarios against capabilities, risks, populations, and regulations, the matrix reveals both what is tested and what is not. A cell with zero scenarios is an explicit acknowledgement of a blind spot. A cell with one scenario is a single-point-of-assurance dependency — if that scenario is poorly specified, the entire coverage for that cell collapses.

The requirement for documented risk acceptance of gaps (4.5) is crucial. Not every gap can or should be filled — resources are finite and some combinations may be genuinely low-risk. But the decision to accept a gap must be conscious, risk-informed, and attributable. "We didn't know this gap existed" is a governance failure. "We identified this gap, assessed the residual risk as low because of mitigating controls X and Y, and the risk owner accepted it on this date" is a governance success, even if the gap remains open.

The material change trigger (4.4) reflects the reality that coverage gaps are not static. Every model update, capability expansion, new user population, or regulatory change can create new gaps in previously adequate coverage. Without a trigger mechanism, coverage degrades silently between review cycles.

6. Implementation Guidance

Effective coverage gap tracking requires three components: a defined coverage space (what should be covered), a mapping from scenarios to that space (what is covered), and a gap analysis process (what is not covered and what to do about it).

Recommended patterns:

Multi-dimensional coverage matrix. Define the coverage space as a multi-dimensional matrix. The minimum four dimensions are: (1) agent capabilities (e.g., query answering, transaction execution, document generation, tool use); (2) risk categories (e.g., financial loss, data breach, safety incident, regulatory violation, reputational harm); (3) user populations (e.g., by language, technical proficiency, vulnerability status, demographic segment); (4) regulatory requirements (e.g., specific articles, principles, or rules). Each scenario in the library is tagged against one or more cells in each dimension. Coverage density is calculated per cell, per row, per column, and overall.
Threshold-based gap flagging. Define coverage thresholds: critical (0 scenarios — immediate action), low (1-2 scenarios — scheduled remediation within 60 days), adequate (3-5 scenarios), strong (6+ scenarios). Automate the flagging so that any scenario addition, removal, or deprecation triggers a recalculation. Dashboard the results so that coverage status is visible to governance stakeholders without requiring manual analysis.
Change-triggered reassessment. Integrate the coverage matrix with the agent's configuration management system. When a model version changes, a new capability is enabled, or the deployment scope changes, automatically generate a gap reassessment task that compares the current matrix against the updated capability and context map. This prevents the silent degradation that occurs when the agent evolves but the coverage matrix does not.
Gap remediation workflow. When a gap is identified, route it through a defined workflow: (1) triage — is this a critical, high, medium, or low priority gap? (2) assign — who is responsible for creating scenarios to fill this gap? (3) deadline — when must the gap be remediated? (4) verify — were the new scenarios created, reviewed, and added to the library? (5) close — update the coverage matrix and record the closure evidence.
Coverage reporting cadence. Produce coverage reports quarterly at minimum, including: total coverage density, coverage by dimension, newly identified gaps, gaps remediated since last report, accepted gaps with justification, and trend analysis showing coverage trajectory over time. Report to governance leadership, not just the testing team.

Anti-patterns to avoid:

Measuring coverage by scenario count alone. An organisation with 500 scenarios may have worse coverage than one with 200 if the 500 are concentrated in two capability areas while the 200 are distributed across ten. Coverage density per cell matters more than total scenario count.
Treating the coverage matrix as a one-time exercise. A coverage matrix built at deployment and never updated becomes a historical artefact, not a governance tool. The matrix must be a living document that evolves with the agent and its context.
Accepting gaps without documentation. Informal gap acceptance ("we know about it, we'll get to it") creates unaccountable risk. If a gap contributes to an incident, the organisation cannot demonstrate that the risk was assessed and accepted — it appears negligent.
Ignoring intersectional gaps. Coverage may be adequate along each individual dimension but have gaps at intersections. For example, an agent may have strong coverage for transaction execution and strong coverage for non-native English speakers, but zero scenarios for transaction execution by non-native English speakers. The coverage matrix must allow intersectional analysis.
Treating all gaps as equal priority. A coverage gap in a low-risk, rarely-used capability is less urgent than a gap in a high-risk, frequently-used capability. Risk-based prioritisation ensures that remediation resources are allocated where they reduce the most risk.

Industry Considerations

Financial Services. Coverage matrices must include regulatory-specific dimensions: FCA conduct rules, MiFID II suitability requirements, AML/KYC obligations, and DORA ICT testing requirements. Coverage gaps in regulatory dimensions should be treated as critical regardless of the frequency of the associated capability, because regulatory non-compliance is assessed per-requirement, not per-frequency.

Healthcare. Coverage must include clinical safety dimensions: medication safety, diagnostic accuracy, triage appropriateness, and patient communication clarity. Coverage gaps in clinical safety dimensions must be escalated to clinical governance rather than technical governance, as the risk assessment requires clinical expertise.

Public Sector. Coverage must include human rights and equality dimensions: coverage across protected characteristics (age, disability, gender, race, religion, sexual orientation), coverage of vulnerable populations, and coverage of accessibility requirements. The Equality Act 2010 and the Public Sector Equality Duty create specific coverage obligations that must be reflected in the matrix.

Maturity Model

Basic Implementation — The organisation maintains a coverage matrix mapping scenarios against capabilities, risk categories, user populations, and regulatory requirements. Coverage density is calculated quarterly. Zero-coverage cells are flagged. Gap remediation is tracked with owners and target dates. Accepted gaps have documented justifications. This level meets the minimum mandatory requirements but gap detection and remediation are largely manual processes.

Intermediate Implementation — Coverage gap detection is automated: scenario library changes trigger matrix recalculation, and threshold breaches generate alerts. Agent configuration changes trigger reassessment workflows. Coverage reports are generated automatically and distributed to governance stakeholders. Adversarial coverage is included as a matrix dimension. Gap remediation follows a defined workflow with SLA targets (e.g., critical gaps addressed within 14 days, low gaps within 60 days).

Advanced Implementation — All intermediate capabilities plus: predictive gap detection uses production telemetry to identify capability areas where observed inputs diverge from scenario coverage. The coverage matrix is three-dimensional and supports intersectional analysis. Coverage trends are tracked over time and correlated with incident rates to validate that coverage improvements reduce real-world failures. External benchmarking compares coverage levels against industry peers or standards bodies.

7. Evidence Requirements

Required artefacts:

Coverage matrix. The current coverage matrix showing all dimensions, cell-level coverage density, and gap classification (critical, low, adequate, strong).
Gap register. A register of all identified gaps, including: gap description, cell reference, priority, assigned owner, target remediation date, status, and (for accepted gaps) risk acceptance documentation.
Quarterly coverage reports. Reports showing coverage density trends, newly identified gaps, remediated gaps, and accepted gaps over the reporting period.
Change-triggered reassessment records. Evidence that material changes to the agent triggered coverage reassessment, including the change description, reassessment date, and findings.

Retention requirements:

Coverage matrices, gap registers, and reports: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Matrix Completeness

Stimulus: Export the coverage matrix. Verify that it includes at least four dimensions: capabilities, risk categories, user populations, and regulatory requirements.
Expected behaviour: All four dimensions are present with defined values.
Pass criteria: The matrix contains at least 4 dimensions, each with at least 3 defined values (e.g., at least 3 capability categories, at least 3 risk categories).
Fail criteria: Any required dimension is missing or contains fewer than 3 defined values.

Test 8.2: Zero-Coverage Cell Flagging

Stimulus: Identify all cells in the coverage matrix with zero scenarios. Verify that each is flagged as a critical gap.
Expected behaviour: Every zero-coverage cell has a corresponding entry in the gap register classified as critical, with either a remediation plan or a documented risk acceptance.
Pass criteria: 100% of zero-coverage cells are flagged and documented.
Fail criteria: Any zero-coverage cell exists without a corresponding gap register entry.

Test 8.3: Quarterly Review Cadence

Stimulus: Request the last four quarterly coverage reports.
Expected behaviour: Four reports exist, each within the expected quarterly cadence.
Pass criteria: All four reports are present with dates no more than 100 days apart. Each report includes coverage density calculations and gap analysis.
Fail criteria: Fewer than four reports exist or any gap exceeds 100 days.

Test 8.4: Material Change Trigger

Stimulus: Identify all material changes to the agent in the last 12 months (model updates, capability additions, deployment scope changes). Verify that each triggered a coverage reassessment within 30 days.
Expected behaviour: Each material change has a corresponding reassessment record dated within 30 days of the change.
Pass criteria: 100% of material changes have timely reassessment records.
Fail criteria: Any material change lacks a corresponding reassessment, or the reassessment occurred more than 30 days after the change.

Test 8.5: Gap Remediation Tracking

Stimulus: Select 10 gaps from the gap register that were identified more than 60 days ago. Verify that each has a status update, assigned owner, and either closure evidence or a documented reason for delay.
Expected behaviour: All 10 gaps show active tracking with current status.
Pass criteria: 100% of sampled gaps have an assigned owner, a target date, and a status update within the last 30 days.
Fail criteria: Any gap lacks an owner, target date, or has not been updated in more than 30 days.

Test 8.6: Risk Acceptance Documentation

Stimulus: Identify all accepted gaps (gaps with a status of "accepted" rather than "remediated" or "in progress"). Verify documentation completeness.
Expected behaviour: Each accepted gap has: a risk description, residual risk assessment, mitigating controls, acceptance authority (named individual), acceptance date, and review date.
Pass criteria: 100% of accepted gaps have complete documentation including all required fields.
Fail criteria: Any accepted gap lacks any required field, or the acceptance authority is not a named individual with appropriate seniority.

Test 8.7: Coverage Density Accuracy

Stimulus: Select 5 cells from the coverage matrix. Independently count the scenarios tagged to each cell. Compare against the reported coverage density.
Expected behaviour: The reported density matches the independent count for all 5 cells.
Pass criteria: Reported density matches independent count within a tolerance of 1 scenario per cell (accounting for in-progress additions).
Fail criteria: Any cell's reported density diverges from the independent count by more than 1 scenario.

Conformance Scoring

Score 0: No coverage gap tracking exists — evaluation coverage is unmeasured and gaps are unidentified.
Score 1: A coverage matrix exists but is static, incomplete, or not maintained — gaps may be identified but are not tracked to remediation.
Score 2: A maintained coverage matrix tracks gaps across all required dimensions, with documented risk acceptance for accepted gaps and remediation tracking for active gaps — meets all mandatory requirements.
Score 3: Verified by independent assessment — an independent party has validated the coverage matrix methodology, gap identification accuracy, and remediation effectiveness, confirming that coverage tracking is comprehensive and risk-informed.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Supports compliance
NIST AI RMF	MAP 2.3, MEASURE 2.5, MANAGE 2.2	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Clause 9.1 (Monitoring)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
DORA	Article 24 (ICT Testing), Article 25 (Threat-Led Penetration Testing)	Direct requirement
Equality Act 2010	Public Sector Equality Duty	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires identification of known and foreseeable risks and the adoption of appropriate risk management measures. Coverage gap tracking is the mechanism that identifies which risks have been evaluated and which have not. An organisation that cannot demonstrate systematic coverage gap tracking cannot demonstrate compliance with the Article 9 requirement to address known and foreseeable risks — because it cannot demonstrate that it has identified all foreseeable risks in the first place.

DORA — Article 24, Article 25

Article 24 requires comprehensive ICT testing. Article 25 requires threat-led penetration testing for significant financial entities. Coverage gap tracking ensures that testing programmes are demonstrably comprehensive (Article 24) and that adversarial testing covers the full threat landscape (Article 25). DORA specifically requires that testing be proportionate to risks — the coverage matrix provides the evidence base for demonstrating proportionality.

Equality Act 2010 — Public Sector Equality Duty

The Public Sector Equality Duty requires public authorities to have due regard to eliminating discrimination and advancing equality of opportunity. For AI agents deployed by public sector organisations, this translates to a requirement that evaluation coverage includes all protected characteristics. Coverage gap tracking against demographic dimensions directly supports this duty by ensuring that no protected group is excluded from evaluation coverage.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — undetected coverage gaps affect the reliability of all evaluation and compliance activities

Consequence chain: Without coverage gap tracking, the organisation operates with unknown blind spots in its evaluation programme. The immediate consequence is false assurance — evaluation pass rates reflect the quality of coverage, not the quality of the agent. The operational consequence is that failures occur in precisely the areas that were not tested, and these failures are undetected until they cause harm. The regulatory consequence is inability to demonstrate comprehensive testing — when a regulator asks "how do you know your testing covers X?", the organisation has no evidence-based answer. The compounding consequence is that coverage gaps tend to cluster around the most difficult and highest-risk areas (adversarial inputs, minority populations, regulatory edge cases), meaning that the areas most likely to produce harmful failures are the areas least likely to be tested.

Cross-references: AG-349 (Scenario Library Governance) provides the scenario inventory that the coverage matrix maps. AG-078 (Benchmark Coverage) defines the benchmark-level coverage requirements that this dimension operationalises. AG-103 (Red-Team Coverage Management) extends coverage tracking to adversarial dimensions. AG-353 (Benchmark Drift Governance) detects when coverage becomes stale relative to real operating conditions. AG-357 (Challenge Set Localisation Governance) addresses coverage gaps in localised contexts.

Cite this protocol

AgentGoverning. (2026). AG-350: Coverage Gap Tracking Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-350

← Previous Protocol

AG-349

Scenario Library Governance

Next Protocol →

AG-351

Human-Subject Evaluation Ethics Governance