AG-078

Benchmark Coverage Governance

Lifecycle, Release & Change Governance ~20 min read AGS v2.1 · April 2026
EU AI Act SOX FCA NIST ISO 42001

2. Summary

Benchmark Coverage Governance requires that every AI agent is evaluated against a defined, versioned benchmark suite that covers the agent's operational scope — including task performance, safety boundaries, governance compliance, edge cases, and adversarial scenarios — and that the benchmark suite is maintained to remain representative of the agent's actual operating conditions. This dimension addresses the risk that agents are evaluated against benchmarks that are too narrow, outdated, or misaligned with their real-world operating environment. A benchmark suite that covers 60% of an agent's operational scenarios provides assurance for 60% of its behaviour and no assurance for the remaining 40%. AG-078 requires organisations to measure, maintain, and improve benchmark coverage systematically, treating coverage gaps as governance risks that must be tracked and remediated.

3. Example

Scenario A — Benchmark Suite Misses Adversarial Scenarios: An organisation deploys a customer-facing agent after it passes a benchmark suite of 2,000 test cases covering common customer queries, product information accuracy, and response tone. The benchmark suite was developed during initial deployment and has not been updated. In production, the agent encounters adversarial inputs — customers attempting to extract internal pricing algorithms, competitors probing for proprietary information, and social engineering attempts to obtain other customers' data. None of these scenarios were in the benchmark suite. The agent handles 340 adversarial interactions over 3 months, and in 47 cases (14%) it discloses information it should not, including internal margin calculations for 12 products. The organisation discovers the issue only when a competitor publishes an analysis using the leaked pricing data.

What went wrong: The benchmark suite was designed for expected operational scenarios but did not cover adversarial scenarios. No benchmark coverage analysis was performed to identify gaps between the benchmark suite and the agent's actual operational profile. The 2,000-case benchmark provided a false sense of security because it tested only the scenarios the designers anticipated. Consequence: Proprietary pricing data leaked to competitors, competitive disadvantage estimated at £1.2 million in lost margin, mandatory agent redesign, regulator notified of data control failure.

Scenario B — Benchmark Drift From Operational Reality: A financial advisory agent is benchmarked against a suite of 5,000 test cases that was representative of the market environment at the time of deployment — interest rates at 4.5%, inflation at 2.3%, equity market volatility at historical averages. Eighteen months later, market conditions have shifted significantly: interest rates have risen to 6.8%, inflation is at 4.1%, and several asset classes have experienced unusual correlation changes. The benchmark suite still tests against the original market conditions. The agent's behaviour under current conditions has never been evaluated. A client receives advice that was correct under the original conditions but inappropriate under current conditions, resulting in a £180,000 portfolio loss.

What went wrong: The benchmark suite was not updated to reflect changes in the agent's operating environment. No mechanism existed to detect drift between benchmark conditions and real-world conditions. The benchmark coverage was measured in number of test cases rather than representativeness of the operational environment. Consequence: £180,000 client loss, FCA complaint, requirement to review all advice given during the coverage gap period, potential redress obligations.

Scenario C — Coverage Gap in Multi-Language Support: An enterprise deploys a multilingual support agent benchmarked against 3,000 English-language test cases and 500 test cases each in French, German, and Spanish. The agent also supports Japanese, Korean, and Mandarin, but no benchmark cases exist for these languages — they were added after the initial benchmark suite was created. The agent's performance in Japanese is significantly worse than in English (42% accuracy versus 94% accuracy on equivalent queries), but this is never detected because no benchmark measures it. Japanese-speaking customers experience consistently poor service for 7 months before complaint volume triggers an investigation.

What went wrong: The benchmark suite was not updated when the agent's operational scope expanded. Benchmark coverage analysis would have revealed the gap immediately — 3 of 7 supported languages had zero benchmark coverage. The organisation measured benchmark pass rate (high, because the suite only covered languages where the agent performed well) rather than benchmark coverage (incomplete, because the suite did not cover all operational scenarios). Consequence: 7 months of degraded service to Japanese-speaking customers, customer attrition, discrimination risk under equality legislation, mandatory remediation and re-benchmarking.

4. Requirement Statement

Scope: This dimension applies to all AI agents deployed in production environments. Every agent that performs tasks, generates outputs, makes recommendations, or takes actions in production must be evaluated against benchmarks that cover its operational scope. The scope extends to all operational dimensions of the agent: task performance (does the agent produce correct outputs?), safety (does the agent refuse unsafe requests and avoid harmful outputs?), governance compliance (does the agent operate within its governance boundaries?), edge cases (how does the agent behave at the boundaries of its operational domain?), and adversarial resilience (how does the agent respond to deliberate manipulation?). The scope includes benchmark coverage of all operational contexts: languages, jurisdictions, customer segments, data types, and environmental conditions. Agents in sandbox or experimental environments without production access are excluded. The test is whether the organisation relies on the agent's outputs for operational purposes — if so, the agent's benchmark coverage must be governed.

4.1. A conforming system MUST maintain a versioned benchmark suite for each production agent, covering at minimum: task performance across the agent's full operational scope, safety boundary compliance, governance control compliance, edge case handling, and adversarial input resilience.

4.2. A conforming system MUST measure and report benchmark coverage as a percentage of the agent's operational scope, using a defined coverage model that maps benchmark test cases to operational scenarios, and identify gaps where operational scenarios lack benchmark coverage.

4.3. A conforming system MUST define minimum benchmark coverage thresholds for each agent and block deployment or continued operation when coverage falls below the threshold. The threshold must be at least 80% of identified operational scenarios for task performance, and at least 90% for safety-critical scenarios.

4.4. A conforming system MUST re-evaluate benchmark coverage whenever the agent's operational scope changes — including new capabilities, new languages, new jurisdictions, new data types, or new integration points — and expand the benchmark suite to cover new scope within 60 calendar days.

4.5. A conforming system MUST re-evaluate benchmark representativeness at least quarterly by comparing benchmark conditions (data distributions, environmental parameters, scenario frequencies) against actual operational data, and update the benchmark suite when drift between benchmark conditions and operational conditions exceeds a defined threshold.

4.6. A conforming system MUST execute the full benchmark suite against the agent at least quarterly and upon any model version change, configuration change, or governance control modification.

4.7. A conforming system SHOULD implement automated benchmark generation that creates new test cases from production data (with appropriate anonymisation) to continuously expand coverage of real-world scenarios.

4.8. A conforming system SHOULD implement coverage gap prioritisation — ranking uncovered operational scenarios by risk impact and likelihood to focus benchmark expansion on the highest-priority gaps first.

4.9. A conforming system SHOULD track benchmark performance trends over time, detecting gradual degradation that may not trigger individual test failures but indicates systematic drift.

4.10. A conforming system MAY implement adversarial benchmark generation — automated creation of adversarial test cases using techniques such as red-team prompting, fuzzing, and metamorphic testing — to continuously challenge the agent's defences.

5. Rationale

Benchmark Coverage Governance addresses the assurance gap between what an organisation tests and what an agent actually does. Benchmarks are the primary mechanism for establishing confidence that an agent behaves correctly, safely, and within governance boundaries. But a benchmark suite provides assurance only for the scenarios it covers. If the benchmark suite covers 60% of the agent's operational scenarios, the organisation has evidence-based confidence in 60% of the agent's behaviour and no evidence-based confidence in the remaining 40%.

This gap is often invisible. Organisations report benchmark pass rates — "our agent passes 97% of benchmark tests" — without reporting benchmark coverage — "our benchmark suite covers 65% of the agent's operational scenarios." A 97% pass rate on a 65%-coverage benchmark means the organisation has demonstrated compliance for 63% of operational scenarios and has no data on the remaining 37%. This creates a false sense of assurance that AG-078 is designed to prevent.

Three categories of coverage failure are common. First, static benchmarks that do not evolve with the agent's operating environment. Market conditions change, customer behaviour shifts, new attack vectors emerge, and the benchmark suite remains frozen at deployment conditions. Second, scope expansion without benchmark expansion. The agent gains new capabilities, supports new languages, enters new jurisdictions, or integrates with new systems, but the benchmark suite is not updated to cover the new scope. Third, adversarial coverage gaps. Benchmark suites are typically designed around expected operational scenarios, not adversarial ones. Adversarial scenarios — prompt injection, information extraction, social engineering, boundary probing — are among the highest-risk scenarios but the least likely to appear in organically developed benchmark suites.

AG-078 intersects with AG-022 (Behavioural Drift Detection) because benchmark execution is a primary mechanism for detecting drift — but only if the benchmark covers the scenarios where drift occurs. It intersects with AG-076 (Assurance Case Maintenance Governance) because benchmark results are key evidence artefacts in the assurance case — but only if the benchmark is representative. It intersects with AG-007 (Governance Configuration Control) because the benchmark suite itself is a governance configuration that must be versioned and change-controlled.

6. Implementation Guidance

AG-078 requires organisations to treat benchmark governance as a continuous process, not a one-time setup. The benchmark suite is a living artefact that must evolve with the agent and its environment.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Benchmark suites for financial agents should cover: all supported instrument types, relevant market conditions (including stress scenarios aligned with regulatory stress testing frameworks such as the Bank of England's annual stress test scenarios), all supported jurisdictions and their regulatory requirements, and adversarial scenarios targeting financial manipulation (e.g., attempts to influence trading recommendations). The FCA expects model validation to include testing under a range of market conditions, not just current conditions. Benchmark coverage should be reported as part of the model risk management framework.

Healthcare. Benchmark suites for healthcare agents should cover: all supported clinical domains, patient demographics (age groups, comorbidity profiles, medication histories), clinical guideline versions, and adversarial scenarios (attempts to obtain inappropriate medical advice, drug interaction queries, emergency scenarios). Benchmark coverage must include rare but critical scenarios (e.g., paediatric dosing, drug allergies, contraindicated combinations). FDA guidance on clinical evaluation of AI/ML-based software requires testing across the intended patient population.

Critical Infrastructure. Benchmark suites for agents in critical infrastructure should cover: all supported operational modes (normal, degraded, emergency), all supported equipment types and configurations, environmental conditions (temperature ranges, load conditions, concurrent operations), and failure scenarios. Safety-critical benchmarks must cover the full range of conditions specified in the safety case. IEC 61508 requires validation testing across the operational profile, which maps directly to AG-078 coverage requirements.

Maturity Model

Basic Implementation — The organisation maintains a documented benchmark suite for each production agent. The suite covers primary task performance scenarios. Coverage is measured informally — the team has a general understanding of what is covered and what is not. The benchmark is executed on model changes and at least annually. Minimum coverage thresholds are defined but may not be formally enforced. This level meets the minimum mandatory requirements but lacks systematic coverage measurement and automated drift detection.

Intermediate Implementation — An operational scope taxonomy is defined for each agent. Benchmark coverage is measured as a percentage of taxonomy nodes with mapped test cases. A coverage dashboard tracks coverage percentage, trends, and gaps. Gap remediation is prioritised by risk. Benchmark condition drift is monitored automatically. Production-derived benchmark augmentation adds test cases from real-world interactions quarterly. The benchmark suite is executed quarterly and on all model or configuration changes. Coverage thresholds are enforced — deployment is blocked when coverage falls below the defined threshold.

Advanced Implementation — All intermediate capabilities plus: automated adversarial benchmark generation continuously expands the adversarial coverage. Machine learning-assisted coverage analysis identifies operational scenarios that are underrepresented in the benchmark suite. Layered benchmark execution provides continuous fast-feedback and periodic deep evaluation. The organisation tracks the correlation between benchmark coverage gaps and production incidents, using this data to refine the coverage model. Independent third-party benchmark assessment is conducted annually. The organisation can demonstrate comprehensive, current benchmark coverage across the full operational scope of every production agent, with quantified coverage metrics and documented gap remediation.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Testing AG-078 compliance requires verifying that benchmarks are comprehensive, representative, and maintained — not merely that they exist and pass.

Test 8.1: Coverage Measurement Accuracy

Test 8.2: Minimum Coverage Threshold Enforcement

Test 8.3: Scope Change Triggers Benchmark Update

Test 8.4: Benchmark Condition Representativeness

Test 8.5: Quarterly Execution Compliance

Test 8.6: Adversarial Scenario Coverage

Test 8.7: Benchmark Version Control

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Direct requirement
EU AI ActArticle 15 (Accuracy, Robustness, Cybersecurity)Direct requirement
EU AI ActArticle 61 (Post-Market Monitoring)Supports compliance
SOXSection 404 (Internal Controls Over Financial Reporting)Supports compliance
FCA SYSC6.1.1R (Systems and Controls)Direct requirement
NIST AI RMFMAP 2.3, MEASURE 2.6, MANAGE 2.2Supports compliance
ISO 42001Clause 9.1 (Monitoring, Measurement, Analysis, Evaluation)Direct requirement
DORAArticle 25 (Testing of ICT Tools and Systems)Direct requirement
FDA GuidanceAI/ML-Based Software as a Medical DeviceSupports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that the risk management system include "testing with a view to identifying the most appropriate risk management measures." AG-078 directly implements this requirement by requiring systematic benchmark testing that covers the agent's full operational scope. The coverage requirement ensures that testing is comprehensive rather than selective. Article 9's emphasis on testing throughout the lifecycle aligns with AG-078's quarterly re-evaluation and scope-change-triggered update requirements.

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires appropriate levels of accuracy, robustness, and cybersecurity. Benchmark coverage governs the evidence base for these claims. An accuracy claim is credible only to the extent that it is supported by benchmarks covering the relevant operational scenarios. AG-078 ensures that accuracy and robustness claims are backed by comprehensive, representative, and current benchmark evidence — not by narrow or outdated testing.

EU AI Act — Article 61 (Post-Market Monitoring)

Article 61 requires post-market monitoring. Ongoing benchmark execution against evolving benchmark suites is a primary mechanism for post-market monitoring of agent performance. AG-078's requirements for quarterly execution, condition drift monitoring, and production-derived benchmark augmentation directly implement continuous post-market evaluation.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For AI agents in financial operations, benchmark testing provides the evidence base for management's assertion that controls over financial reporting are effective. The coverage requirement ensures that testing addresses the full scope of the agent's financial operations, not just a sample. Auditors will assess whether the benchmark suite is representative and current.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA expects firms to validate AI models under a range of conditions. SS1/23 specifically addresses ongoing model validation, including testing under stress conditions and changed market environments. AG-078's benchmark condition drift monitoring and representativeness requirements directly support this expectation. A firm that can demonstrate comprehensive, current benchmark coverage across its agent's operational scope is well-positioned for supervisory review.

DORA — Article 25 (Testing of ICT Tools and Systems)

DORA Article 25 requires financial entities to establish programmes for testing ICT tools and systems, including threat-led penetration testing. Benchmark testing of AI agents falls within this requirement. AG-078's adversarial benchmark coverage requirement supports the threat-led testing component. The quarterly execution cadence aligns with DORA's expectation of regular testing.

ISO 42001 — Clause 9.1 (Monitoring, Measurement, Analysis, Evaluation)

Clause 9.1 requires organisations to determine what needs to be monitored and measured for AI systems. Benchmark coverage is a primary metric for monitoring AI system performance and governance compliance. AG-078 defines the measurement methodology (operational scope taxonomy, coverage percentage), the measurement frequency (quarterly minimum), and the action thresholds (minimum coverage thresholds).

FDA Guidance — AI/ML-Based Software as a Medical Device

FDA guidance on AI/ML-based medical devices requires clinical evaluation across the intended patient population and use conditions. For AI agents that function as or support medical devices, benchmark coverage must include the full range of clinical scenarios, patient demographics, and use conditions. AG-078's operational scope taxonomy maps to the FDA's expectation of comprehensive clinical evaluation.

NIST AI RMF — MAP 2.3, MEASURE 2.6, MANAGE 2.2

MAP 2.3 addresses scientific integrity of AI metrics and methodologies. MEASURE 2.6 addresses the measurement of AI system performance. MANAGE 2.2 addresses risk mitigation through testing. AG-078 supports all three by requiring rigorous, comprehensive benchmark measurement as the foundation for performance and risk claims.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — affecting confidence in all production agents where benchmark coverage is inadequate

Consequence chain: When benchmark coverage is inadequate, the organisation has blind spots in its understanding of agent behaviour. The immediate consequence is that the agent operates in scenarios for which no evaluation evidence exists — its behaviour in those scenarios is unknown. If the agent behaves incorrectly in uncovered scenarios, the failure may persist undetected for months because no benchmark is watching for it. The delayed detection is the key amplifier: a customer-facing agent that handles adversarial queries incorrectly does so repeatedly until complaint volume or an external event triggers investigation. A financial agent that gives inappropriate advice under changed market conditions does so for every affected client until a loss event triggers review. The organisational consequence compounds: when an incident occurs in an area not covered by benchmarks, the post-incident review reveals that the organisation had no evidence of agent behaviour in that scenario. This undermines confidence in the entire benchmark programme — if this gap existed, what other gaps exist? Regulators and auditors interpret benchmark coverage gaps as systemic weaknesses in the governance programme. The remediation is expensive: the organisation must rapidly expand benchmark coverage, re-evaluate all agents, and potentially restrict operations until comprehensive coverage is demonstrated. In financial services, the FCA may require independent validation of benchmark coverage, adding time and cost to remediation.

Cross-references: AG-007 (Governance Configuration Control) — benchmark suites are governance configurations requiring version control. AG-022 (Behavioural Drift Detection) — benchmark execution is a primary drift detection mechanism, but only for covered scenarios. AG-076 (Assurance Case Maintenance Governance) — benchmark results are key evidence artefacts in the assurance case; coverage gaps undermine assurance case claims.

Cite this protocol
AgentGoverning. (2026). AG-078: Benchmark Coverage Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-078