The Standard

The 841 Dimensions Regulatory Mapping Version History

Compliance

Compliance Leaderboard Platform Comparison

Verification

Submit for Verification Self-Assessment Tool

About

About AgentGoverning Press & Media

Contact

AG-078

Benchmark Coverage Governance

Lifecycle, Release & Change Governance ~20 min read AGS v2.1 · April 2026

EU AI Act SOX FCA NIST ISO 42001

2. Summary

Benchmark Coverage Governance requires that every AI agent is evaluated against a defined, versioned benchmark suite that covers the agent's operational scope — including task performance, safety boundaries, governance compliance, edge cases, and adversarial scenarios — and that the benchmark suite is maintained to remain representative of the agent's actual operating conditions. This dimension addresses the risk that agents are evaluated against benchmarks that are too narrow, outdated, or misaligned with their real-world operating environment. A benchmark suite that covers 60% of an agent's operational scenarios provides assurance for 60% of its behaviour and no assurance for the remaining 40%. AG-078 requires organisations to measure, maintain, and improve benchmark coverage systematically, treating coverage gaps as governance risks that must be tracked and remediated.

3. Example

Scenario A — Benchmark Suite Misses Adversarial Scenarios: An organisation deploys a customer-facing agent after it passes a benchmark suite of 2,000 test cases covering common customer queries, product information accuracy, and response tone. The benchmark suite was developed during initial deployment and has not been updated. In production, the agent encounters adversarial inputs — customers attempting to extract internal pricing algorithms, competitors probing for proprietary information, and social engineering attempts to obtain other customers' data. None of these scenarios were in the benchmark suite. The agent handles 340 adversarial interactions over 3 months, and in 47 cases (14%) it discloses information it should not, including internal margin calculations for 12 products. The organisation discovers the issue only when a competitor publishes an analysis using the leaked pricing data.

What went wrong: The benchmark suite was designed for expected operational scenarios but did not cover adversarial scenarios. No benchmark coverage analysis was performed to identify gaps between the benchmark suite and the agent's actual operational profile. The 2,000-case benchmark provided a false sense of security because it tested only the scenarios the designers anticipated. Consequence: Proprietary pricing data leaked to competitors, competitive disadvantage estimated at £1.2 million in lost margin, mandatory agent redesign, regulator notified of data control failure.

Scenario B — Benchmark Drift From Operational Reality: A financial advisory agent is benchmarked against a suite of 5,000 test cases that was representative of the market environment at the time of deployment — interest rates at 4.5%, inflation at 2.3%, equity market volatility at historical averages. Eighteen months later, market conditions have shifted significantly: interest rates have risen to 6.8%, inflation is at 4.1%, and several asset classes have experienced unusual correlation changes. The benchmark suite still tests against the original market conditions. The agent's behaviour under current conditions has never been evaluated. A client receives advice that was correct under the original conditions but inappropriate under current conditions, resulting in a £180,000 portfolio loss.

What went wrong: The benchmark suite was not updated to reflect changes in the agent's operating environment. No mechanism existed to detect drift between benchmark conditions and real-world conditions. The benchmark coverage was measured in number of test cases rather than representativeness of the operational environment. Consequence: £180,000 client loss, FCA complaint, requirement to review all advice given during the coverage gap period, potential redress obligations.

Scenario C — Coverage Gap in Multi-Language Support: An enterprise deploys a multilingual support agent benchmarked against 3,000 English-language test cases and 500 test cases each in French, German, and Spanish. The agent also supports Japanese, Korean, and Mandarin, but no benchmark cases exist for these languages — they were added after the initial benchmark suite was created. The agent's performance in Japanese is significantly worse than in English (42% accuracy versus 94% accuracy on equivalent queries), but this is never detected because no benchmark measures it. Japanese-speaking customers experience consistently poor service for 7 months before complaint volume triggers an investigation.

What went wrong: The benchmark suite was not updated when the agent's operational scope expanded. Benchmark coverage analysis would have revealed the gap immediately — 3 of 7 supported languages had zero benchmark coverage. The organisation measured benchmark pass rate (high, because the suite only covered languages where the agent performed well) rather than benchmark coverage (incomplete, because the suite did not cover all operational scenarios). Consequence: 7 months of degraded service to Japanese-speaking customers, customer attrition, discrimination risk under equality legislation, mandatory remediation and re-benchmarking.

4. Requirement Statement

Scope: This dimension applies to all AI agents deployed in production environments. Every agent that performs tasks, generates outputs, makes recommendations, or takes actions in production must be evaluated against benchmarks that cover its operational scope. The scope extends to all operational dimensions of the agent: task performance (does the agent produce correct outputs?), safety (does the agent refuse unsafe requests and avoid harmful outputs?), governance compliance (does the agent operate within its governance boundaries?), edge cases (how does the agent behave at the boundaries of its operational domain?), and adversarial resilience (how does the agent respond to deliberate manipulation?). The scope includes benchmark coverage of all operational contexts: languages, jurisdictions, customer segments, data types, and environmental conditions. Agents in sandbox or experimental environments without production access are excluded. The test is whether the organisation relies on the agent's outputs for operational purposes — if so, the agent's benchmark coverage must be governed.

4.1. A conforming system MUST maintain a versioned benchmark suite for each production agent, covering at minimum: task performance across the agent's full operational scope, safety boundary compliance, governance control compliance, edge case handling, and adversarial input resilience.

4.2. A conforming system MUST measure and report benchmark coverage as a percentage of the agent's operational scope, using a defined coverage model that maps benchmark test cases to operational scenarios, and identify gaps where operational scenarios lack benchmark coverage.

4.3. A conforming system MUST define minimum benchmark coverage thresholds for each agent and block deployment or continued operation when coverage falls below the threshold. The threshold must be at least 80% of identified operational scenarios for task performance, and at least 90% for safety-critical scenarios.

4.4. A conforming system MUST re-evaluate benchmark coverage whenever the agent's operational scope changes — including new capabilities, new languages, new jurisdictions, new data types, or new integration points — and expand the benchmark suite to cover new scope within 60 calendar days.

4.5. A conforming system MUST re-evaluate benchmark representativeness at least quarterly by comparing benchmark conditions (data distributions, environmental parameters, scenario frequencies) against actual operational data, and update the benchmark suite when drift between benchmark conditions and operational conditions exceeds a defined threshold.

4.6. A conforming system MUST execute the full benchmark suite against the agent at least quarterly and upon any model version change, configuration change, or governance control modification.

4.7. A conforming system SHOULD implement automated benchmark generation that creates new test cases from production data (with appropriate anonymisation) to continuously expand coverage of real-world scenarios.

4.8. A conforming system SHOULD implement coverage gap prioritisation — ranking uncovered operational scenarios by risk impact and likelihood to focus benchmark expansion on the highest-priority gaps first.

4.9. A conforming system SHOULD track benchmark performance trends over time, detecting gradual degradation that may not trigger individual test failures but indicates systematic drift.

4.10. A conforming system MAY implement adversarial benchmark generation — automated creation of adversarial test cases using techniques such as red-team prompting, fuzzing, and metamorphic testing — to continuously challenge the agent's defences.

5. Rationale

Benchmark Coverage Governance addresses the assurance gap between what an organisation tests and what an agent actually does. Benchmarks are the primary mechanism for establishing confidence that an agent behaves correctly, safely, and within governance boundaries. But a benchmark suite provides assurance only for the scenarios it covers. If the benchmark suite covers 60% of the agent's operational scenarios, the organisation has evidence-based confidence in 60% of the agent's behaviour and no evidence-based confidence in the remaining 40%.

This gap is often invisible. Organisations report benchmark pass rates — "our agent passes 97% of benchmark tests" — without reporting benchmark coverage — "our benchmark suite covers 65% of the agent's operational scenarios." A 97% pass rate on a 65%-coverage benchmark means the organisation has demonstrated compliance for 63% of operational scenarios and has no data on the remaining 37%. This creates a false sense of assurance that AG-078 is designed to prevent.

Three categories of coverage failure are common. First, static benchmarks that do not evolve with the agent's operating environment. Market conditions change, customer behaviour shifts, new attack vectors emerge, and the benchmark suite remains frozen at deployment conditions. Second, scope expansion without benchmark expansion. The agent gains new capabilities, supports new languages, enters new jurisdictions, or integrates with new systems, but the benchmark suite is not updated to cover the new scope. Third, adversarial coverage gaps. Benchmark suites are typically designed around expected operational scenarios, not adversarial ones. Adversarial scenarios — prompt injection, information extraction, social engineering, boundary probing — are among the highest-risk scenarios but the least likely to appear in organically developed benchmark suites.

AG-078 intersects with AG-022 (Behavioural Drift Detection) because benchmark execution is a primary mechanism for detecting drift — but only if the benchmark covers the scenarios where drift occurs. It intersects with AG-076 (Assurance Case Maintenance Governance) because benchmark results are key evidence artefacts in the assurance case — but only if the benchmark is representative. It intersects with AG-007 (Governance Configuration Control) because the benchmark suite itself is a governance configuration that must be versioned and change-controlled.

6. Implementation Guidance

AG-078 requires organisations to treat benchmark governance as a continuous process, not a one-time setup. The benchmark suite is a living artefact that must evolve with the agent and its environment.

Recommended patterns:

Operational scope mapping. Before measuring coverage, define the agent's operational scope as a structured taxonomy. For a customer-facing agent, this might include: query types (informational, transactional, complaint, escalation), languages, customer segments, product categories, and interaction channels. For a financial agent: instrument types, market conditions, transaction types, counterparty types, and jurisdictions. Each leaf node in the taxonomy represents an operational scenario that should have benchmark coverage. Map each benchmark test case to one or more taxonomy nodes. Coverage is the percentage of taxonomy nodes that have at least one mapped test case. Target initial mapping within 30 days of deployment and update the taxonomy quarterly.
Coverage dashboard with gap tracking. Implement a dashboard showing: total benchmark test cases, operational scope taxonomy nodes, covered nodes (percentage), uncovered nodes (list), coverage trend over time, and gap remediation queue. The gap remediation queue prioritises uncovered nodes by risk: safety-critical scenarios receive the highest priority, high-volume operational scenarios receive the next priority, and low-volume edge cases receive the lowest priority. Assign a maximum remediation timeframe for each priority level (e.g., 30 days for safety-critical gaps, 60 days for high-volume gaps, 90 days for edge-case gaps).
Production-derived benchmark augmentation. Periodically sample production interactions (with appropriate anonymisation and consent) and compare them against the benchmark taxonomy. Interactions that map to uncovered or under-represented taxonomy nodes become candidates for new benchmark test cases. This ensures the benchmark suite evolves to reflect actual operational patterns rather than assumed patterns. Target sampling at least 1,000 production interactions per quarter for coverage analysis.
Benchmark condition drift detection. For benchmarks that test under specific conditions (e.g., financial benchmarks that assume particular market conditions), implement automated comparison between benchmark conditions and current real-world conditions. When the divergence exceeds a defined threshold (e.g., a market parameter has shifted by more than 2 standard deviations from benchmark assumptions), flag the affected benchmark tests for update. This prevents the silent degradation of benchmark representativeness.
Layered benchmark execution. Implement benchmark execution in layers: a fast "smoke test" layer (50-100 critical test cases, execution time under 10 minutes) that runs on every configuration change; a "standard" layer (full benchmark suite, execution time under 2 hours) that runs quarterly and on model changes; and a "deep" layer (extended adversarial and edge-case testing, execution time up to 24 hours) that runs semi-annually or before major deployments. This layered approach balances coverage rigour with operational velocity.

Anti-patterns to avoid:

Measuring pass rate without measuring coverage. A 99% pass rate on a benchmark that covers 50% of operational scenarios is not a 99% quality metric — it is a 99% quality metric for half the agent's behaviour and no quality metric for the other half. Always report coverage alongside pass rate.
Static benchmark suites. A benchmark suite that was comprehensive at deployment becomes incomplete over time as the agent's operational context changes. Benchmarks must evolve. If the benchmark suite has not changed in 12 months, it is almost certainly incomplete.
Benchmark suites derived solely from training data. Benchmarks constructed from the same distribution as training data test the agent's ability to reproduce learned patterns, not its ability to handle the full range of production scenarios. Production data sampling is essential for representative benchmarks.
Coverage measurement by test case count rather than scenario coverage. An agent with 10,000 test cases that all cover the same 30% of operational scenarios has low coverage despite a high test case count. Coverage must be measured against the operational scope taxonomy, not by counting tests.
Excluding adversarial scenarios from coverage requirements. Adversarial scenarios are among the highest-risk operational scenarios. Excluding them from coverage measurement creates a systematic blind spot in the organisation's assurance posture.

Industry Considerations

Financial Services. Benchmark suites for financial agents should cover: all supported instrument types, relevant market conditions (including stress scenarios aligned with regulatory stress testing frameworks such as the Bank of England's annual stress test scenarios), all supported jurisdictions and their regulatory requirements, and adversarial scenarios targeting financial manipulation (e.g., attempts to influence trading recommendations). The FCA expects model validation to include testing under a range of market conditions, not just current conditions. Benchmark coverage should be reported as part of the model risk management framework.

Healthcare. Benchmark suites for healthcare agents should cover: all supported clinical domains, patient demographics (age groups, comorbidity profiles, medication histories), clinical guideline versions, and adversarial scenarios (attempts to obtain inappropriate medical advice, drug interaction queries, emergency scenarios). Benchmark coverage must include rare but critical scenarios (e.g., paediatric dosing, drug allergies, contraindicated combinations). FDA guidance on clinical evaluation of AI/ML-based software requires testing across the intended patient population.

Critical Infrastructure. Benchmark suites for agents in critical infrastructure should cover: all supported operational modes (normal, degraded, emergency), all supported equipment types and configurations, environmental conditions (temperature ranges, load conditions, concurrent operations), and failure scenarios. Safety-critical benchmarks must cover the full range of conditions specified in the safety case. IEC 61508 requires validation testing across the operational profile, which maps directly to AG-078 coverage requirements.

Maturity Model

Basic Implementation — The organisation maintains a documented benchmark suite for each production agent. The suite covers primary task performance scenarios. Coverage is measured informally — the team has a general understanding of what is covered and what is not. The benchmark is executed on model changes and at least annually. Minimum coverage thresholds are defined but may not be formally enforced. This level meets the minimum mandatory requirements but lacks systematic coverage measurement and automated drift detection.

Intermediate Implementation — An operational scope taxonomy is defined for each agent. Benchmark coverage is measured as a percentage of taxonomy nodes with mapped test cases. A coverage dashboard tracks coverage percentage, trends, and gaps. Gap remediation is prioritised by risk. Benchmark condition drift is monitored automatically. Production-derived benchmark augmentation adds test cases from real-world interactions quarterly. The benchmark suite is executed quarterly and on all model or configuration changes. Coverage thresholds are enforced — deployment is blocked when coverage falls below the defined threshold.

Advanced Implementation — All intermediate capabilities plus: automated adversarial benchmark generation continuously expands the adversarial coverage. Machine learning-assisted coverage analysis identifies operational scenarios that are underrepresented in the benchmark suite. Layered benchmark execution provides continuous fast-feedback and periodic deep evaluation. The organisation tracks the correlation between benchmark coverage gaps and production incidents, using this data to refine the coverage model. Independent third-party benchmark assessment is conducted annually. The organisation can demonstrate comprehensive, current benchmark coverage across the full operational scope of every production agent, with quantified coverage metrics and documented gap remediation.

7. Evidence Requirements

Required artefacts:

Benchmark suite. The versioned benchmark suite for each agent, including: all test cases, expected outcomes, operational scope taxonomy mapping, and condition parameters. Format: structured data in a test management system or equivalent.
Operational scope taxonomy. The documented operational scope model for each agent, defining the taxonomy of operational scenarios against which benchmark coverage is measured.
Coverage reports. Periodic coverage reports showing: coverage percentage, covered and uncovered taxonomy nodes, gap prioritisation, and remediation plans. Minimum frequency: quarterly.
Benchmark execution results. Timestamped results of each benchmark execution, including: pass/fail for each test case, aggregate pass rate, coverage at time of execution, and the agent version and configuration tested. Minimum retention: 7 years for regulated financial services; 5 years for other regulated sectors; 3 years otherwise.
Benchmark condition drift analysis. Records of comparisons between benchmark conditions and real-world conditions, with the drift assessment and any resulting benchmark updates.

Retention requirements:

Benchmark suites, execution results, and coverage reports: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Testing AG-078 compliance requires verifying that benchmarks are comprehensive, representative, and maintained — not merely that they exist and pass.

Test 8.1: Coverage Measurement Accuracy

Stimulus: Review the operational scope taxonomy and the benchmark coverage mapping for a production agent. Independently assess whether the mapping is accurate — does each benchmark test case actually test the operational scenario it is mapped to?
Expected behaviour: The mapping is accurate. Test cases are correctly mapped to taxonomy nodes. The reported coverage percentage reflects actual coverage.
Pass criteria: Independent assessment confirms that the coverage mapping is accurate to within 5 percentage points of the reported coverage.
Fail criteria: The mapping is inaccurate — test cases are mapped to incorrect taxonomy nodes, or the reported coverage overstates actual coverage by more than 5 percentage points.

Test 8.2: Minimum Coverage Threshold Enforcement

Stimulus: Reduce benchmark coverage below the defined minimum threshold (e.g., by expanding the operational scope taxonomy without adding corresponding benchmark tests) and attempt to deploy or continue operating the agent.
Expected behaviour: The system blocks deployment or flags continued operation when coverage falls below the threshold.
Pass criteria: Coverage below the threshold triggers a blocking control (deployment blocked) or a mandatory remediation action with a defined timeline.
Fail criteria: Coverage below the threshold does not trigger any enforcement action, or the agent continues to operate without any remediation plan.

Test 8.3: Scope Change Triggers Benchmark Update

Stimulus: Expand the agent's operational scope (e.g., add a new supported language, a new jurisdiction, or a new capability) and verify that the benchmark suite is updated.
Expected behaviour: The scope change is detected. The benchmark coverage analysis identifies the new uncovered scenarios. A benchmark expansion plan is created with a deadline within 60 calendar days. New benchmark test cases are created and executed.
Pass criteria: Benchmark coverage for the new scope is achieved within 60 days of the scope change.
Fail criteria: The scope change does not trigger a benchmark update, or coverage for the new scope is not achieved within 60 days.

Test 8.4: Benchmark Condition Representativeness

Stimulus: Compare the conditions assumed by the benchmark suite (e.g., data distributions, environmental parameters) against current operational data.
Expected behaviour: Benchmark conditions are representative of current operational conditions. Where drift has occurred, the benchmark suite has been updated or a documented exception exists.
Pass criteria: Benchmark conditions match operational conditions within defined drift thresholds, or documented exceptions exist with risk acceptance.
Fail criteria: Benchmark conditions have drifted significantly from operational conditions without detection or remediation.

Test 8.5: Quarterly Execution Compliance

Stimulus: Request benchmark execution records for the past 12 months for a production agent.
Expected behaviour: At least 4 full benchmark executions are recorded within the past 12 months (quarterly cadence). Each execution record includes: date, agent version, configuration, pass/fail results, and coverage at time of execution.
Pass criteria: At least 4 complete benchmark executions are recorded in the past 12 months, with complete records.
Fail criteria: Fewer than 4 executions are recorded, or execution records are incomplete.

Test 8.6: Adversarial Scenario Coverage

Stimulus: Review the benchmark suite for coverage of adversarial scenarios: prompt injection, information extraction, social engineering, boundary probing, and other attack types relevant to the agent's operational context.
Expected behaviour: The benchmark suite includes adversarial test cases covering the primary attack vectors relevant to the agent's operational context. Adversarial coverage is included in the overall coverage measurement.
Pass criteria: Adversarial scenarios are represented in the benchmark suite with coverage of at least the primary attack vectors. Adversarial test case count is at least 10% of total test cases.
Fail criteria: No adversarial test cases exist, or adversarial scenarios are excluded from coverage measurement.

Test 8.7: Benchmark Version Control

Stimulus: Request the version history of the benchmark suite.
Expected behaviour: The benchmark suite is version-controlled with a complete change history showing: each addition, modification, and removal of test cases, with dates, reasons, and approvals.
Pass criteria: The version history is complete and the benchmark suite is managed under the same change control as other governance configurations per AG-007.
Fail criteria: The benchmark suite is not version-controlled, or the change history is incomplete.

Conformance Scoring

Score 0: No benchmark suite exists — the agent operates in production without systematic evaluation of its behaviour.
Score 1: A benchmark suite exists but coverage is not measured against the operational scope. The suite is static and has not been updated since deployment. Execution is ad hoc.
Score 2: A benchmark suite with measured coverage against an operational scope taxonomy, minimum coverage thresholds enforced, quarterly execution, and benchmark condition drift monitoring. All MUST requirements are met.
Score 3: Automated coverage analysis, production-derived benchmark augmentation, adversarial benchmark generation, layered execution, incident-correlated coverage improvement, and independent third-party assessment. Benchmark coverage is a continuously optimised, quantified governance metric.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Direct requirement
EU AI Act	Article 61 (Post-Market Monitoring)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	MAP 2.3, MEASURE 2.6, MANAGE 2.2	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis, Evaluation)	Direct requirement
DORA	Article 25 (Testing of ICT Tools and Systems)	Direct requirement
FDA Guidance	AI/ML-Based Software as a Medical Device	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that the risk management system include "testing with a view to identifying the most appropriate risk management measures." AG-078 directly implements this requirement by requiring systematic benchmark testing that covers the agent's full operational scope. The coverage requirement ensures that testing is comprehensive rather than selective. Article 9's emphasis on testing throughout the lifecycle aligns with AG-078's quarterly re-evaluation and scope-change-triggered update requirements.

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires appropriate levels of accuracy, robustness, and cybersecurity. Benchmark coverage governs the evidence base for these claims. An accuracy claim is credible only to the extent that it is supported by benchmarks covering the relevant operational scenarios. AG-078 ensures that accuracy and robustness claims are backed by comprehensive, representative, and current benchmark evidence — not by narrow or outdated testing.

EU AI Act — Article 61 (Post-Market Monitoring)

Article 61 requires post-market monitoring. Ongoing benchmark execution against evolving benchmark suites is a primary mechanism for post-market monitoring of agent performance. AG-078's requirements for quarterly execution, condition drift monitoring, and production-derived benchmark augmentation directly implement continuous post-market evaluation.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For AI agents in financial operations, benchmark testing provides the evidence base for management's assertion that controls over financial reporting are effective. The coverage requirement ensures that testing addresses the full scope of the agent's financial operations, not just a sample. Auditors will assess whether the benchmark suite is representative and current.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA expects firms to validate AI models under a range of conditions. SS1/23 specifically addresses ongoing model validation, including testing under stress conditions and changed market environments. AG-078's benchmark condition drift monitoring and representativeness requirements directly support this expectation. A firm that can demonstrate comprehensive, current benchmark coverage across its agent's operational scope is well-positioned for supervisory review.

DORA — Article 25 (Testing of ICT Tools and Systems)

DORA Article 25 requires financial entities to establish programmes for testing ICT tools and systems, including threat-led penetration testing. Benchmark testing of AI agents falls within this requirement. AG-078's adversarial benchmark coverage requirement supports the threat-led testing component. The quarterly execution cadence aligns with DORA's expectation of regular testing.

ISO 42001 — Clause 9.1 (Monitoring, Measurement, Analysis, Evaluation)

Clause 9.1 requires organisations to determine what needs to be monitored and measured for AI systems. Benchmark coverage is a primary metric for monitoring AI system performance and governance compliance. AG-078 defines the measurement methodology (operational scope taxonomy, coverage percentage), the measurement frequency (quarterly minimum), and the action thresholds (minimum coverage thresholds).

FDA Guidance — AI/ML-Based Software as a Medical Device

FDA guidance on AI/ML-based medical devices requires clinical evaluation across the intended patient population and use conditions. For AI agents that function as or support medical devices, benchmark coverage must include the full range of clinical scenarios, patient demographics, and use conditions. AG-078's operational scope taxonomy maps to the FDA's expectation of comprehensive clinical evaluation.

NIST AI RMF — MAP 2.3, MEASURE 2.6, MANAGE 2.2

MAP 2.3 addresses scientific integrity of AI metrics and methodologies. MEASURE 2.6 addresses the measurement of AI system performance. MANAGE 2.2 addresses risk mitigation through testing. AG-078 supports all three by requiring rigorous, comprehensive benchmark measurement as the foundation for performance and risk claims.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — affecting confidence in all production agents where benchmark coverage is inadequate

Consequence chain: When benchmark coverage is inadequate, the organisation has blind spots in its understanding of agent behaviour. The immediate consequence is that the agent operates in scenarios for which no evaluation evidence exists — its behaviour in those scenarios is unknown. If the agent behaves incorrectly in uncovered scenarios, the failure may persist undetected for months because no benchmark is watching for it. The delayed detection is the key amplifier: a customer-facing agent that handles adversarial queries incorrectly does so repeatedly until complaint volume or an external event triggers investigation. A financial agent that gives inappropriate advice under changed market conditions does so for every affected client until a loss event triggers review. The organisational consequence compounds: when an incident occurs in an area not covered by benchmarks, the post-incident review reveals that the organisation had no evidence of agent behaviour in that scenario. This undermines confidence in the entire benchmark programme — if this gap existed, what other gaps exist? Regulators and auditors interpret benchmark coverage gaps as systemic weaknesses in the governance programme. The remediation is expensive: the organisation must rapidly expand benchmark coverage, re-evaluate all agents, and potentially restrict operations until comprehensive coverage is demonstrated. In financial services, the FCA may require independent validation of benchmark coverage, adding time and cost to remediation.

Cross-references: AG-007 (Governance Configuration Control) — benchmark suites are governance configurations requiring version control. AG-022 (Behavioural Drift Detection) — benchmark execution is a primary drift detection mechanism, but only for covered scenarios. AG-076 (Assurance Case Maintenance Governance) — benchmark results are key evidence artefacts in the assurance case; coverage gaps undermine assurance case claims.

Cite this protocol

AgentGoverning. (2026). AG-078: Benchmark Coverage Governance. The Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-078

← Previous Protocol

AG-077

Generated Artefact Promotion Governance

Next Protocol →

AG-079

Delegation Chain Provenance Governance