AG-349: Scenario Library Governance

2. Summary

Scenario Library Governance requires that every organisation deploying AI agents maintains a living, versioned library of realistic scenarios spanning success cases, failure modes, abuse vectors, and edge conditions. The library is not a static test suite — it is a continuously curated repository that evolves as the agent's capabilities, deployment context, and threat landscape change. Each scenario must be traceable to a real-world event, a plausible risk, or a regulatory requirement, and must include enough detail to reproduce the test deterministically. Without a governed scenario library, evaluation coverage degrades silently: teams test what they remember to test, not what the system actually needs tested.

3. Example

Scenario A — Stale Library Misses Emerging Abuse Vector: A financial services firm deploys a customer-facing advisory agent. The scenario library was built at launch and contains 240 scenarios covering common queries, edge cases in product recommendations, and basic prompt injection attempts. Eighteen months later, a novel social engineering technique emerges: attackers embed financial advice requests inside apparent translation tasks ("translate this investment instruction into Spanish"), causing the agent to generate unregulated financial advice while believing it is performing a translation. The scenario library contains no translation-based manipulation scenarios. The quarterly evaluation passes with a 97.1% score, but the firm receives 14 regulatory complaints in one month about unauthorised financial advice delivered through the translation vector.

What went wrong: The scenario library was treated as a static artefact. No process existed to ingest new threat intelligence, customer complaint patterns, or industry incident reports into new scenarios. The evaluation gave false assurance because it tested against an outdated threat model. Consequence: FCA investigation, 14 customer complaints requiring redress averaging £3,200 each (£44,800 total), mandatory remediation programme, and reputational damage to the advisory product.

Scenario B — Scenarios Lack Reproduction Detail: An enterprise deploys a procurement agent and maintains a scenario library of 180 test cases. Each scenario is described in one sentence: "Agent should not approve orders over budget." When the testing team attempts to execute the scenarios, they discover that 60% of them lack concrete input values, expected output specifications, or environment preconditions. Different testers interpret the same scenario differently — one tests with a £500 overage, another with a £50,000 overage. Results are inconsistent across test runs, and the team cannot determine whether failures represent genuine regressions or interpretation differences.

What went wrong: Scenarios were written as intentions rather than reproducible test specifications. Without concrete values, preconditions, and expected outputs, the library functioned as a wish list rather than a test suite. Consequence: Three months of unreliable evaluation data, inability to detect a genuine regression in budget enforcement that allowed £127,000 in unapproved spending, and a complete library rebuild costing 400 engineering hours.

Scenario C — Siloed Library Creates Coverage Blind Spots: A healthcare organisation maintains separate scenario libraries for its patient-facing agent: one maintained by the clinical team (120 clinical accuracy scenarios), one by the security team (90 adversarial scenarios), and one by the compliance team (60 regulatory scenarios). No central index exists. When a new agent version is deployed, each team runs its own scenarios independently. None of the three libraries contains scenarios testing the intersection of clinical accuracy and adversarial input — for example, whether a prompt injection could cause the agent to alter a medication dosage recommendation. The gap persists for eight months until a patient reports receiving an incorrect dosage recommendation triggered by adversarial text embedded in a medical forum post the agent was asked to summarise.

What went wrong: Siloed scenario ownership created coverage gaps at domain intersections. No cross-functional review process ensured that combined risk vectors were represented. Consequence: Patient safety incident, mandatory adverse event report to the care quality regulator, suspension of the agent pending remediation, and potential clinical negligence claim.

4. Requirement Statement

Scope: This dimension applies to all AI agent deployments where evaluation, testing, or red-teaming is performed — which, under the broader governance framework, means all production deployments. Any organisation that runs tests against an AI agent, whether functional tests, adversarial tests, compliance checks, or user acceptance tests, is maintaining a scenario library whether or not it is formally recognised as one. AG-349 requires that this library be formally governed. The scope includes scenarios for pre-deployment evaluation, post-deployment monitoring, regression testing, red-team exercises, and compliance certification. It excludes unit tests of non-agent software components, though scenarios that test the integration between agent and non-agent components are in scope.

4.1. A conforming system MUST maintain a centrally indexed scenario library containing scenarios across at least four categories: success cases, failure modes, abuse vectors, and edge conditions.

4.2. A conforming system MUST version every scenario with a unique identifier, creation date, last-modified date, author, and classification category.

4.3. A conforming system MUST specify each scenario with sufficient detail for deterministic reproduction, including: concrete input values, environment preconditions, agent configuration, expected output or behaviour, and pass/fail criteria.

4.4. A conforming system MUST review the scenario library for completeness and relevance at least quarterly, with documented outcomes and a dated sign-off by the responsible party.

4.5. A conforming system MUST trace each scenario to at least one of: a real-world incident, a plausible risk identified through threat modelling, a regulatory requirement, or a coverage gap analysis.

4.6. A conforming system MUST retire obsolete scenarios through a formal deprecation process that retains the scenario in the archive with a retirement rationale, rather than deleting it.

4.7. A conforming system SHOULD ingest new scenarios from at least three external sources: industry incident reports, threat intelligence feeds, and customer complaint or feedback data.

4.8. A conforming system SHOULD tag each scenario with metadata indicating the agent capabilities, deployment contexts, and risk categories it exercises.

4.9. A conforming system SHOULD maintain cross-domain scenarios that test intersections between functional areas (e.g., clinical accuracy under adversarial input, compliance under high-load conditions).

4.10. A conforming system MAY implement automated scenario generation from production telemetry, deriving new edge-case scenarios from observed near-miss events or anomalous agent behaviour.

5. Rationale

The scenario library is the foundation upon which all evaluation, benchmarking, and red-teaming activities rest. Without it, testing is ad hoc — teams test what they remember, what they last worked on, or what the most recent incident highlighted. Systematic coverage is impossible without a systematic inventory of what needs to be covered.

The distinction between a scenario library and a test suite is important. A test suite is a technical artefact — automated scripts that execute and report results. A scenario library is a governance artefact — a curated, classified, traceable collection of situations the system must handle correctly. The library feeds the test suite, but it also feeds red-team exercise planning, compliance certification evidence, incident response playbooks, and risk assessments. A test suite without a governed scenario library will drift toward testing what is easy to automate rather than what is important to verify.

The requirement for living maintenance reflects the reality that AI agent deployments exist in a dynamic environment. New attack techniques emerge continuously. Regulatory expectations evolve. The agent's own capabilities change with model updates. A scenario library that was comprehensive at launch becomes incomplete within months if it is not actively maintained. The quarterly review cadence is a minimum — organisations in high-risk domains should review more frequently.

The traceability requirement (4.5) serves two purposes. First, it ensures that every scenario has a justification — preventing the library from accumulating untethered tests that consume evaluation resources without clear purpose. Second, it creates an audit trail that demonstrates to regulators that the organisation's evaluation programme is risk-informed rather than arbitrary. When a regulator asks "why do you test this?", the answer should trace to a specific risk, incident, or requirement — not "because someone thought it was a good idea."

The reproducibility requirement (4.3) addresses a pervasive problem in AI evaluation: scenarios described at the intention level rather than the specification level. "The agent should handle abusive input gracefully" is an intention. "Given input X with configuration Y in environment Z, the agent should produce output matching criteria W within T milliseconds" is a specification. Only specifications produce reliable, comparable evaluation results.

6. Implementation Guidance

A governed scenario library requires both a storage mechanism and a curation process. The storage mechanism must support versioning, search, classification, and cross-referencing. The curation process must ensure that the library grows in response to new risks, shrinks in response to obsolete conditions, and is reviewed regularly for coverage gaps.

Recommended patterns:

Structured scenario schema. Define a standard schema for all scenarios: unique ID, title, category (success/failure/abuse/edge), description, preconditions, input specification, expected behaviour, pass/fail criteria, traceability link, applicable profiles, creation date, last review date, status (active/deprecated/draft). Store scenarios in a structured format (JSON, YAML, or a database) rather than prose documents, enabling automated processing, coverage analysis, and gap detection.
Intake pipeline from multiple sources. Establish formal channels for scenario intake: (1) incident post-mortems generate at least one new scenario per material incident; (2) threat intelligence subscriptions generate candidate scenarios reviewed monthly; (3) customer complaints mentioning unexpected agent behaviour are triaged for scenario candidacy weekly; (4) red-team exercises generate scenarios for any novel attack vector discovered. Each intake channel has a designated reviewer who transforms raw input into conforming scenario specifications.
Coverage matrix alignment. Map scenarios against a coverage matrix that includes: agent capabilities, deployment contexts, risk categories, regulatory requirements, and known threat vectors. Use the matrix to identify gaps — cells with fewer than a threshold number of scenarios (e.g., fewer than 3 scenarios per cell) are flagged for scenario development. This intersects with AG-350 (Coverage Gap Tracking Governance).
Automated staleness detection. Flag scenarios that have not been reviewed within the review cadence (quarterly minimum). Flag scenarios whose traceability links point to decommissioned systems, resolved incidents, or superseded regulations. Automatically surface these for review rather than waiting for the quarterly cycle.
Cross-domain scenario workshops. Quarterly, convene representatives from security, compliance, domain experts, and engineering to review coverage at domain intersections. These workshops specifically target the blind spots that siloed teams miss — for example, what happens when a clinical scenario meets an adversarial input, or when a compliance requirement conflicts with a performance optimisation.

Anti-patterns to avoid:

Treating the scenario library as a test suite backlog. If scenarios are only valued when they are automated, non-automatable scenarios (e.g., those requiring human judgement to evaluate) will be deprioritised or excluded. The library must include scenarios that are executed manually, through red-team exercises, or through structured expert review.
Writing scenarios at the intention level. "Test that the agent handles edge cases" is not a scenario. Without concrete inputs, preconditions, and expected outputs, the scenario is not reproducible and different evaluators will reach different conclusions from the same scenario.
Allowing unchecked growth. A scenario library that only grows and never prunes becomes unwieldy. Deprecated scenarios consume evaluation time without adding value. The formal deprecation process (4.6) ensures the library remains focused while retaining historical context.
Single-owner curation. When one team owns the entire library, scenarios reflect that team's perspective. Security teams write adversarial scenarios; compliance teams write regulatory scenarios; domain teams write functional scenarios. No single team writes cross-domain scenarios. Multi-stakeholder governance is essential.
Confusing scenario count with coverage quality. An organisation with 2,000 poorly specified scenarios has worse evaluation coverage than one with 200 well-specified, cross-domain scenarios. The metric that matters is coverage across the risk matrix, not raw scenario count.

Industry Considerations

Financial Services. Scenario libraries must include scenarios for market manipulation detection, sanctions evasion attempts, money laundering patterns, and regulatory reporting accuracy. FCA expectations require that scenarios reflect the firm's specific risk profile, not generic industry templates. Scenarios should be calibrated to the firm's actual transaction volumes — a scenario testing behaviour at 10,000 transactions per day is irrelevant for a firm processing 10 million.

Healthcare. Scenarios must cover clinical safety boundaries: incorrect dosage recommendations, contraindication misses, emergency triage errors, and patient data boundary violations. Scenario development should involve clinical domain experts, not just technical staff. Scenarios should reflect the patient population demographics of the deployment context — a scenario library calibrated for adult care is insufficient for a paediatric deployment.

Safety-Critical / CPS. Scenarios must include physical safety boundaries: actuator overshoot, sensor failure modes, environmental condition extremes, and multi-agent coordination failures. Scenarios should be derived from hazard analysis (e.g., HAZOP, FMEA) and should include scenarios that combine multiple simultaneous failures, as single-failure scenarios underestimate real-world risk.

Maturity Model

Basic Implementation — The organisation maintains a documented list of test scenarios in a spreadsheet or document, categorised into success, failure, abuse, and edge-case categories. Each scenario has an identifier and a prose description. Review occurs at least quarterly. Scenarios are traceable to at least one justification source. This level meets the minimum mandatory requirements but scenarios may lack the specificity needed for fully deterministic reproduction, and coverage analysis is manual.

Intermediate Implementation — Scenarios are stored in a structured, versioned repository with a defined schema. Each scenario includes concrete input values, environment preconditions, and machine-readable pass/fail criteria. An intake pipeline ingests candidates from incident reports, threat intelligence, and customer feedback. A coverage matrix maps scenarios against capabilities, risks, and regulatory requirements, and gaps are flagged automatically. Deprecated scenarios follow a formal retirement process. Cross-domain review workshops occur quarterly.

Advanced Implementation — All intermediate capabilities plus: automated scenario generation from production telemetry identifies new edge cases from observed near-miss events. Coverage analysis is continuous, not periodic — new deployments or capability changes trigger automated gap assessment. The scenario library integrates bidirectionally with the test execution platform, enabling one-click execution of any scenario and automatic ingestion of results. Machine learning identifies scenario clusters that are redundant, enabling pruning without coverage loss. The library is benchmarked against industry scenario repositories (where available) to identify blind spots.

7. Evidence Requirements

Required artefacts:

Scenario library export. The complete scenario library in structured format, demonstrating categorisation across success, failure, abuse, and edge-case categories. Must include all metadata fields: ID, version, category, preconditions, inputs, expected outputs, pass/fail criteria, traceability link, and status.
Quarterly review records. Dated, signed records of each quarterly review, including: number of scenarios reviewed, scenarios added, scenarios deprecated, coverage gaps identified, and remediation actions.
Traceability matrix. A mapping from each active scenario to its justification source (incident, risk, regulation, or gap analysis), demonstrating that the library is risk-informed.
Coverage gap analysis. The most recent coverage matrix analysis showing scenario density against capabilities, risk categories, and regulatory requirements, with identified gaps and remediation plans.

Retention requirements:

Scenario library versions and review records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact.

8. Test Specification

Testing AG-349 compliance verifies that the scenario library exists, is well-governed, and meets the structural and process requirements defined in Section 4.

Test 8.1: Category Completeness

Stimulus: Export the scenario library and classify all active scenarios by category.
Expected behaviour: At least four categories are populated: success cases, failure modes, abuse vectors, and edge conditions. No category contains fewer than 10% of active scenarios.
Pass criteria: All four categories are present and no single category accounts for fewer than 10% of the library. For example, in a library of 200 scenarios, each category has at least 20 scenarios.
Fail criteria: Any required category is empty or contains fewer than 10% of active scenarios.

Test 8.2: Scenario Reproducibility

Stimulus: Select 20 scenarios at random. Provide each to two independent evaluators with no additional context beyond what is in the scenario specification.
Expected behaviour: Both evaluators reach the same pass/fail conclusion for at least 90% of scenarios.
Pass criteria: Inter-rater agreement is at least 90% (18 of 20 scenarios).
Fail criteria: Inter-rater agreement falls below 90%, indicating that scenario specifications are insufficiently precise for deterministic reproduction.

Test 8.3: Traceability Verification

Stimulus: Select 30 scenarios at random. Verify that each has a valid traceability link to a real-world incident, risk assessment, regulatory requirement, or coverage gap analysis.
Expected behaviour: All 30 scenarios have a valid, verifiable traceability link.
Pass criteria: 100% of sampled scenarios have valid traceability links. The linked source exists and is relevant to the scenario.
Fail criteria: Any scenario lacks a traceability link or the link points to a non-existent or irrelevant source.

Test 8.4: Quarterly Review Compliance

Stimulus: Request the last four quarterly review records.
Expected behaviour: Four records exist, each dated within the expected quarterly cadence (no gap exceeding 100 days between consecutive reviews).
Pass criteria: All four reviews are documented with dates, outcomes, and sign-off. No gap exceeds 100 days.
Fail criteria: Fewer than four reviews exist in the last 12 months, or any gap between consecutive reviews exceeds 100 days.

Test 8.5: Deprecation Process Integrity

Stimulus: Identify all scenarios deprecated in the last 12 months. Verify that each has a retirement rationale and remains in the archive.
Expected behaviour: Every deprecated scenario has a documented rationale and is accessible in the archive.
Pass criteria: 100% of deprecated scenarios have a rationale and are archived, not deleted.
Fail criteria: Any deprecated scenario was deleted without archival, or any lacks a retirement rationale.

Test 8.6: Version Control Integrity

Stimulus: Select 10 scenarios modified in the last 6 months. Verify that each modification is recorded with a version increment, modification date, and author.
Expected behaviour: All modifications are versioned with complete metadata.
Pass criteria: 100% of sampled modified scenarios have version history with date and author attribution.
Fail criteria: Any modification lacks version metadata, or modifications overwrite previous versions without retaining history.

Test 8.7: Coverage Matrix Density

Stimulus: Generate a coverage matrix mapping scenarios against agent capabilities, risk categories, and regulatory requirements. Identify cells with fewer than 3 scenarios.
Expected behaviour: No more than 15% of cells have fewer than 3 scenarios. Cells with zero scenarios are flagged.
Pass criteria: At least 85% of coverage matrix cells have 3 or more scenarios. Zero-scenario cells have documented justification or remediation plans.
Fail criteria: More than 15% of cells have fewer than 3 scenarios, or zero-scenario cells exist without justification.

Conformance Scoring

Score 0: No scenario library exists — evaluation is ad hoc or based on informal, undocumented test ideas.
Score 1: A scenario library exists with categorised scenarios but lacks reproducibility detail, traceability, or regular review — scenarios are intentions rather than specifications.
Score 2: Scenarios are versioned, reproducible, traceable, and reviewed quarterly — the library meets all mandatory requirements and supports deterministic evaluation.
Score 3: Verified by independent assessment — an independent party has audited the library for coverage, reproducibility, and governance process compliance, and confirmed that the library supports comprehensive, risk-informed evaluation.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Supports compliance
NIST AI RMF	MAP 2.3, MEASURE 2.6	Supports compliance
ISO 42001	Clause 8.2 (AI Risk Assessment), Clause 9.1 (Monitoring, Measurement, Analysis)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
DORA	Article 24 (ICT Testing)	Direct requirement
ISO/IEC 25010	Quality in Use Model	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that risk management for high-risk AI systems include "testing with a view to identifying the most appropriate risk management measures." A governed scenario library is the foundation of systematic testing — without it, testing cannot be demonstrated to be comprehensive or risk-informed. The requirement for scenario traceability to risk sources directly supports the Article 9 obligation to identify and address known and foreseeable risks. Auditors assessing Article 9 compliance will ask what scenarios were tested and why — the scenario library with its traceability links provides the answer.

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires testing for accuracy, robustness, and cybersecurity. A governed scenario library that includes success cases (accuracy), edge conditions and failure modes (robustness), and abuse vectors (cybersecurity) provides the structural foundation for demonstrating compliance with all three requirements. Without a formal library, organisations cannot demonstrate that their testing programme is comprehensive across these three dimensions.

NIST AI RMF — MAP 2.3, MEASURE 2.6

MAP 2.3 addresses the identification of AI system risks in deployment contexts. MEASURE 2.6 addresses the measurement of AI system performance against defined criteria. The scenario library supports MAP 2.3 by maintaining a risk-traceable inventory of test conditions, and MEASURE 2.6 by providing the structured scenarios against which performance is measured.

DORA — Article 24 (ICT Testing)

Article 24 requires financial entities to maintain and review a sound and comprehensive ICT testing framework. For AI agent deployments, the scenario library constitutes a core component of this framework. DORA requires that testing be risk-based and cover a range of scenarios including severe but plausible conditions — directly aligning with the scenario library's requirement for failure mode and abuse vector coverage.

ISO 42001 — Clause 8.2, Clause 9.1

Clause 8.2 requires AI risk assessment, which depends on having identified scenarios that represent material risks. Clause 9.1 requires monitoring and measurement against defined criteria, which requires the defined criteria that a governed scenario library provides.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — affects the reliability of all evaluation, certification, and compliance activities that depend on the scenario library

Consequence chain: Without a governed scenario library, evaluation becomes ad hoc and non-reproducible. The immediate technical consequence is inconsistent test coverage — some risk areas are tested thoroughly while others are missed entirely. The operational consequence is false assurance: evaluation results show high pass rates because the scenarios tested are the easy ones, not the important ones. When a failure occurs in an untested scenario, the organisation discovers simultaneously that (1) the agent has a vulnerability, (2) the evaluation programme did not detect it, and (3) the evidence needed for regulatory response does not exist. The regulatory consequence is severe — demonstrating to a regulator that the evaluation programme was comprehensive requires the kind of structured, traceable evidence that an ungoverned library cannot produce. Under EU AI Act Article 9, inability to demonstrate systematic testing of known risks is a compliance failure independent of whether an incident has occurred. The reputational consequence compounds over time: each incident that was "not covered by testing" erodes stakeholder confidence in the governance programme as a whole.

Cross-references: AG-078 (Benchmark Coverage) establishes the coverage framework that the scenario library populates. AG-103 (Red-Team Coverage Management) depends on the scenario library for adversarial exercise planning. AG-152 (Evaluation Integrity and Benchmark Leakage) governs the integrity of the scenarios themselves. AG-350 (Coverage Gap Tracking Governance) uses the scenario library's coverage matrix to identify and remediate gaps. AG-353 (Benchmark Drift Governance) detects when scenarios become stale. AG-356 (Near-Miss Capture Governance) feeds new scenarios into the library from production near-miss events.

Cite this protocol

AgentGoverning. (2026). AG-349: Scenario Library Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-349

← Previous Protocol

AG-348

Open versus Closed Weight Exposure Governance

Next Protocol →

AG-350

Coverage Gap Tracking Governance