Scenario Library Governance requires that every organisation deploying AI agents maintains a living, versioned library of realistic scenarios spanning success cases, failure modes, abuse vectors, and edge conditions. The library is not a static test suite — it is a continuously curated repository that evolves as the agent's capabilities, deployment context, and threat landscape change. Each scenario must be traceable to a real-world event, a plausible risk, or a regulatory requirement, and must include enough detail to reproduce the test deterministically. Without a governed scenario library, evaluation coverage degrades silently: teams test what they remember to test, not what the system actually needs tested.
Scenario A — Stale Library Misses Emerging Abuse Vector: A financial services firm deploys a customer-facing advisory agent. The scenario library was built at launch and contains 240 scenarios covering common queries, edge cases in product recommendations, and basic prompt injection attempts. Eighteen months later, a novel social engineering technique emerges: attackers embed financial advice requests inside apparent translation tasks ("translate this investment instruction into Spanish"), causing the agent to generate unregulated financial advice while believing it is performing a translation. The scenario library contains no translation-based manipulation scenarios. The quarterly evaluation passes with a 97.1% score, but the firm receives 14 regulatory complaints in one month about unauthorised financial advice delivered through the translation vector.
What went wrong: The scenario library was treated as a static artefact. No process existed to ingest new threat intelligence, customer complaint patterns, or industry incident reports into new scenarios. The evaluation gave false assurance because it tested against an outdated threat model. Consequence: FCA investigation, 14 customer complaints requiring redress averaging £3,200 each (£44,800 total), mandatory remediation programme, and reputational damage to the advisory product.
Scenario B — Scenarios Lack Reproduction Detail: An enterprise deploys a procurement agent and maintains a scenario library of 180 test cases. Each scenario is described in one sentence: "Agent should not approve orders over budget." When the testing team attempts to execute the scenarios, they discover that 60% of them lack concrete input values, expected output specifications, or environment preconditions. Different testers interpret the same scenario differently — one tests with a £500 overage, another with a £50,000 overage. Results are inconsistent across test runs, and the team cannot determine whether failures represent genuine regressions or interpretation differences.
What went wrong: Scenarios were written as intentions rather than reproducible test specifications. Without concrete values, preconditions, and expected outputs, the library functioned as a wish list rather than a test suite. Consequence: Three months of unreliable evaluation data, inability to detect a genuine regression in budget enforcement that allowed £127,000 in unapproved spending, and a complete library rebuild costing 400 engineering hours.
Scenario C — Siloed Library Creates Coverage Blind Spots: A healthcare organisation maintains separate scenario libraries for its patient-facing agent: one maintained by the clinical team (120 clinical accuracy scenarios), one by the security team (90 adversarial scenarios), and one by the compliance team (60 regulatory scenarios). No central index exists. When a new agent version is deployed, each team runs its own scenarios independently. None of the three libraries contains scenarios testing the intersection of clinical accuracy and adversarial input — for example, whether a prompt injection could cause the agent to alter a medication dosage recommendation. The gap persists for eight months until a patient reports receiving an incorrect dosage recommendation triggered by adversarial text embedded in a medical forum post the agent was asked to summarise.
What went wrong: Siloed scenario ownership created coverage gaps at domain intersections. No cross-functional review process ensured that combined risk vectors were represented. Consequence: Patient safety incident, mandatory adverse event report to the care quality regulator, suspension of the agent pending remediation, and potential clinical negligence claim.
Scope: This dimension applies to all AI agent deployments where evaluation, testing, or red-teaming is performed — which, under the broader governance framework, means all production deployments. Any organisation that runs tests against an AI agent, whether functional tests, adversarial tests, compliance checks, or user acceptance tests, is maintaining a scenario library whether or not it is formally recognised as one. AG-349 requires that this library be formally governed. The scope includes scenarios for pre-deployment evaluation, post-deployment monitoring, regression testing, red-team exercises, and compliance certification. It excludes unit tests of non-agent software components, though scenarios that test the integration between agent and non-agent components are in scope.
4.1. A conforming system MUST maintain a centrally indexed scenario library containing scenarios across at least four categories: success cases, failure modes, abuse vectors, and edge conditions.
4.2. A conforming system MUST version every scenario with a unique identifier, creation date, last-modified date, author, and classification category.
4.3. A conforming system MUST specify each scenario with sufficient detail for deterministic reproduction, including: concrete input values, environment preconditions, agent configuration, expected output or behaviour, and pass/fail criteria.
4.4. A conforming system MUST review the scenario library for completeness and relevance at least quarterly, with documented outcomes and a dated sign-off by the responsible party.
4.5. A conforming system MUST trace each scenario to at least one of: a real-world incident, a plausible risk identified through threat modelling, a regulatory requirement, or a coverage gap analysis.
4.6. A conforming system MUST retire obsolete scenarios through a formal deprecation process that retains the scenario in the archive with a retirement rationale, rather than deleting it.
4.7. A conforming system SHOULD ingest new scenarios from at least three external sources: industry incident reports, threat intelligence feeds, and customer complaint or feedback data.
4.8. A conforming system SHOULD tag each scenario with metadata indicating the agent capabilities, deployment contexts, and risk categories it exercises.
4.9. A conforming system SHOULD maintain cross-domain scenarios that test intersections between functional areas (e.g., clinical accuracy under adversarial input, compliance under high-load conditions).
4.10. A conforming system MAY implement automated scenario generation from production telemetry, deriving new edge-case scenarios from observed near-miss events or anomalous agent behaviour.
The scenario library is the foundation upon which all evaluation, benchmarking, and red-teaming activities rest. Without it, testing is ad hoc — teams test what they remember, what they last worked on, or what the most recent incident highlighted. Systematic coverage is impossible without a systematic inventory of what needs to be covered.
The distinction between a scenario library and a test suite is important. A test suite is a technical artefact — automated scripts that execute and report results. A scenario library is a governance artefact — a curated, classified, traceable collection of situations the system must handle correctly. The library feeds the test suite, but it also feeds red-team exercise planning, compliance certification evidence, incident response playbooks, and risk assessments. A test suite without a governed scenario library will drift toward testing what is easy to automate rather than what is important to verify.
The requirement for living maintenance reflects the reality that AI agent deployments exist in a dynamic environment. New attack techniques emerge continuously. Regulatory expectations evolve. The agent's own capabilities change with model updates. A scenario library that was comprehensive at launch becomes incomplete within months if it is not actively maintained. The quarterly review cadence is a minimum — organisations in high-risk domains should review more frequently.
The traceability requirement (4.5) serves two purposes. First, it ensures that every scenario has a justification — preventing the library from accumulating untethered tests that consume evaluation resources without clear purpose. Second, it creates an audit trail that demonstrates to regulators that the organisation's evaluation programme is risk-informed rather than arbitrary. When a regulator asks "why do you test this?", the answer should trace to a specific risk, incident, or requirement — not "because someone thought it was a good idea."
The reproducibility requirement (4.3) addresses a pervasive problem in AI evaluation: scenarios described at the intention level rather than the specification level. "The agent should handle abusive input gracefully" is an intention. "Given input X with configuration Y in environment Z, the agent should produce output matching criteria W within T milliseconds" is a specification. Only specifications produce reliable, comparable evaluation results.
A governed scenario library requires both a storage mechanism and a curation process. The storage mechanism must support versioning, search, classification, and cross-referencing. The curation process must ensure that the library grows in response to new risks, shrinks in response to obsolete conditions, and is reviewed regularly for coverage gaps.
Recommended patterns:
Anti-patterns to avoid:
Financial Services. Scenario libraries must include scenarios for market manipulation detection, sanctions evasion attempts, money laundering patterns, and regulatory reporting accuracy. FCA expectations require that scenarios reflect the firm's specific risk profile, not generic industry templates. Scenarios should be calibrated to the firm's actual transaction volumes — a scenario testing behaviour at 10,000 transactions per day is irrelevant for a firm processing 10 million.
Healthcare. Scenarios must cover clinical safety boundaries: incorrect dosage recommendations, contraindication misses, emergency triage errors, and patient data boundary violations. Scenario development should involve clinical domain experts, not just technical staff. Scenarios should reflect the patient population demographics of the deployment context — a scenario library calibrated for adult care is insufficient for a paediatric deployment.
Safety-Critical / CPS. Scenarios must include physical safety boundaries: actuator overshoot, sensor failure modes, environmental condition extremes, and multi-agent coordination failures. Scenarios should be derived from hazard analysis (e.g., HAZOP, FMEA) and should include scenarios that combine multiple simultaneous failures, as single-failure scenarios underestimate real-world risk.
Basic Implementation — The organisation maintains a documented list of test scenarios in a spreadsheet or document, categorised into success, failure, abuse, and edge-case categories. Each scenario has an identifier and a prose description. Review occurs at least quarterly. Scenarios are traceable to at least one justification source. This level meets the minimum mandatory requirements but scenarios may lack the specificity needed for fully deterministic reproduction, and coverage analysis is manual.
Intermediate Implementation — Scenarios are stored in a structured, versioned repository with a defined schema. Each scenario includes concrete input values, environment preconditions, and machine-readable pass/fail criteria. An intake pipeline ingests candidates from incident reports, threat intelligence, and customer feedback. A coverage matrix maps scenarios against capabilities, risks, and regulatory requirements, and gaps are flagged automatically. Deprecated scenarios follow a formal retirement process. Cross-domain review workshops occur quarterly.
Advanced Implementation — All intermediate capabilities plus: automated scenario generation from production telemetry identifies new edge cases from observed near-miss events. Coverage analysis is continuous, not periodic — new deployments or capability changes trigger automated gap assessment. The scenario library integrates bidirectionally with the test execution platform, enabling one-click execution of any scenario and automatic ingestion of results. Machine learning identifies scenario clusters that are redundant, enabling pruning without coverage loss. The library is benchmarked against industry scenario repositories (where available) to identify blind spots.
Required artefacts:
Retention requirements:
Access requirements:
Testing AG-349 compliance verifies that the scenario library exists, is well-governed, and meets the structural and process requirements defined in Section 4.
Test 8.1: Category Completeness
Test 8.2: Scenario Reproducibility
Test 8.3: Traceability Verification
Test 8.4: Quarterly Review Compliance
Test 8.5: Deprecation Process Integrity
Test 8.6: Version Control Integrity
Test 8.7: Coverage Matrix Density
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 9 (Risk Management System) | Direct requirement |
| EU AI Act | Article 15 (Accuracy, Robustness, Cybersecurity) | Supports compliance |
| NIST AI RMF | MAP 2.3, MEASURE 2.6 | Supports compliance |
| ISO 42001 | Clause 8.2 (AI Risk Assessment), Clause 9.1 (Monitoring, Measurement, Analysis) | Supports compliance |
| FCA SYSC | 6.1.1R (Systems and Controls) | Supports compliance |
| DORA | Article 24 (ICT Testing) | Direct requirement |
| ISO/IEC 25010 | Quality in Use Model | Supports compliance |
Article 9 requires that risk management for high-risk AI systems include "testing with a view to identifying the most appropriate risk management measures." A governed scenario library is the foundation of systematic testing — without it, testing cannot be demonstrated to be comprehensive or risk-informed. The requirement for scenario traceability to risk sources directly supports the Article 9 obligation to identify and address known and foreseeable risks. Auditors assessing Article 9 compliance will ask what scenarios were tested and why — the scenario library with its traceability links provides the answer.
Article 15 requires testing for accuracy, robustness, and cybersecurity. A governed scenario library that includes success cases (accuracy), edge conditions and failure modes (robustness), and abuse vectors (cybersecurity) provides the structural foundation for demonstrating compliance with all three requirements. Without a formal library, organisations cannot demonstrate that their testing programme is comprehensive across these three dimensions.
MAP 2.3 addresses the identification of AI system risks in deployment contexts. MEASURE 2.6 addresses the measurement of AI system performance against defined criteria. The scenario library supports MAP 2.3 by maintaining a risk-traceable inventory of test conditions, and MEASURE 2.6 by providing the structured scenarios against which performance is measured.
Article 24 requires financial entities to maintain and review a sound and comprehensive ICT testing framework. For AI agent deployments, the scenario library constitutes a core component of this framework. DORA requires that testing be risk-based and cover a range of scenarios including severe but plausible conditions — directly aligning with the scenario library's requirement for failure mode and abuse vector coverage.
Clause 8.2 requires AI risk assessment, which depends on having identified scenarios that represent material risks. Clause 9.1 requires monitoring and measurement against defined criteria, which requires the defined criteria that a governed scenario library provides.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | Organisation-wide — affects the reliability of all evaluation, certification, and compliance activities that depend on the scenario library |
Consequence chain: Without a governed scenario library, evaluation becomes ad hoc and non-reproducible. The immediate technical consequence is inconsistent test coverage — some risk areas are tested thoroughly while others are missed entirely. The operational consequence is false assurance: evaluation results show high pass rates because the scenarios tested are the easy ones, not the important ones. When a failure occurs in an untested scenario, the organisation discovers simultaneously that (1) the agent has a vulnerability, (2) the evaluation programme did not detect it, and (3) the evidence needed for regulatory response does not exist. The regulatory consequence is severe — demonstrating to a regulator that the evaluation programme was comprehensive requires the kind of structured, traceable evidence that an ungoverned library cannot produce. Under EU AI Act Article 9, inability to demonstrate systematic testing of known risks is a compliance failure independent of whether an incident has occurred. The reputational consequence compounds over time: each incident that was "not covered by testing" erodes stakeholder confidence in the governance programme as a whole.
Cross-references: AG-078 (Benchmark Coverage) establishes the coverage framework that the scenario library populates. AG-103 (Red-Team Coverage Management) depends on the scenario library for adversarial exercise planning. AG-152 (Evaluation Integrity and Benchmark Leakage) governs the integrity of the scenarios themselves. AG-350 (Coverage Gap Tracking Governance) uses the scenario library's coverage matrix to identify and remediate gaps. AG-353 (Benchmark Drift Governance) detects when scenarios become stale. AG-356 (Near-Miss Capture Governance) feeds new scenarios into the library from production near-miss events.