AG-275: Policy Simulation Sandbox Governance

2. Summary

Policy Simulation Sandbox Governance requires that every policy change is tested against realistic scenarios in an isolated environment before production release, and that the sandbox environment is sufficiently representative to reveal behavioural differences that would occur in production. The sandbox is not a unit test environment (covered by AG-271) or a compilation verification step (covered by AG-270) — it is a full-fidelity simulation that replays real or realistic decision traffic through the proposed policy and compares the outcomes against the current production policy. This dimension mandates that the sandbox is isolated from production (no sandbox actions can affect real-world state), that the simulation uses representative data, and that the results are analysed before the policy change is approved for production.

3. Example

Scenario A — Policy Change Triggers Unexpected Mass Rejection: A customer-facing agent serves 80,000 loan applications per month. A policy update tightens the debt-to-income threshold from 45% to 40%. The team estimates this will reject an additional 3-5% of applications. The change is deployed to production without sandbox testing. In the first week, the rejection rate increases by 22% — not the expected 3-5%. Investigation reveals that the new threshold interacts with a pre-existing rule that counts student loan obligations differently for applicants under 30. The combined effect of the two rules is far larger than the threshold change alone.

What went wrong: The policy change was tested only in isolation (unit tests for the threshold change passed). No sandbox simulation replayed recent production traffic through the new policy to measure the actual impact. The rule interaction was invisible without production-scale data. Consequence: 1,760 additional rejected applications in the first week, reputational damage from social media complaints, emergency rollback, estimated revenue loss of £440,000.

Scenario B — Sandbox Leaks to Production: An enterprise workflow agent's sandbox environment shares a database connection pool with the production environment for "efficiency." A developer tests a policy that automatically escalates high-priority tickets. The sandbox test triggers 340 escalation notifications to real managers through the shared notification service. Managers respond to the escalations, creating confusion and wasted effort.

What went wrong: The sandbox was not isolated from production systems. Shared infrastructure meant that sandbox actions had real-world effects. Consequence: 340 false escalations, 12 hours of organisational disruption, loss of trust in the governance process, policy change delayed by 3 weeks while isolation is established.

Scenario C — Sandbox Uses Stale Data and Misses Population Shift: A financial-value agent's sandbox replays decision traffic from 6 months ago to test a new credit scoring policy. In the intervening 6 months, the customer population has shifted: a marketing campaign attracted a younger demographic with lower average credit scores. The sandbox shows the new policy rejecting 8% of applicants — within acceptable bounds. In production, the new policy rejects 14% because the current population has lower credit scores than the 6-month-old test data.

What went wrong: The sandbox used stale data that was not representative of the current production population. The simulation results were misleadingly optimistic. Consequence: Rejection rate 75% higher than predicted, customer complaints, regulatory scrutiny for potential fair lending impact on younger demographic.

4. Requirement Statement

Scope: This dimension applies to all AI agents governed by policy rules where policy changes can affect production decision outcomes. Any system where a policy change could alter the outcome of a decision for a real customer, counterparty, or operational process is within scope. Systems where policy changes have no effect on decision outcomes (e.g., cosmetic changes to policy documentation) are excluded. The scope extends to all types of policy changes: new rules, modified rules, removed rules, threshold changes, precedence changes, and jurisdiction-specific variants.

4.1. A conforming system MUST provide a sandbox environment where proposed policy changes can be evaluated against realistic decision traffic before production activation.

4.2. A conforming system MUST ensure complete isolation between the sandbox and production — no sandbox action can affect real-world state, including: no external API calls, no database writes to production data stores, no notifications to real users, and no financial transactions.

4.3. A conforming system MUST use representative data in the sandbox that reflects the current production population and decision distribution, with a maximum staleness of 30 days for the underlying data set.

4.4. A conforming system MUST produce a quantitative impact report comparing sandbox results under the proposed policy against results under the current production policy, showing: total decisions affected, percentage of decisions with changed outcomes, distribution of changes by category, and any new rule interactions detected.

4.5. A conforming system MUST require that the impact report is reviewed and approved by an authorised policy owner before the change is activated in production.

4.6. A conforming system SHOULD replay at least 10,000 recent production decisions (or the full decision volume if fewer than 10,000 in the reference period) through the sandbox to achieve statistical significance.

4.7. A conforming system SHOULD detect and highlight rule interactions in the sandbox that do not appear in unit tests — cases where the outcome changes only because of how the modified rule interacts with other rules.

4.8. A conforming system SHOULD provide a diff view showing, for each decision that changes, the old outcome, the new outcome, and the rules responsible for the change.

4.9. A conforming system MAY implement continuous sandbox shadowing — running the proposed policy in parallel with production (without affecting outcomes) and comparing results in real time for a defined observation period.

5. Rationale

Unit tests (AG-271) verify that individual rules and their interactions work correctly for defined test inputs. Compilation verification (AG-270) confirms that compiled rules match the source policy. Neither reveals the production-scale impact of a policy change — how many real decisions will change, which populations will be affected, and whether rule interactions at production scale produce unexpected emergent effects.

The sandbox addresses this gap by testing the proposed policy against realistic, representative decision traffic. The key insight is that policy changes are not like software deployments where the primary risk is a crash or error. Policy changes produce valid, well-formed decisions — but the decisions may be wrong or may have unacceptable distributional impact. A tighter threshold produces more rejections. A new rule interacts with existing rules in unexpected ways. These effects are invisible to unit tests because unit tests exercise predetermined inputs, not the actual distribution of inputs the system will encounter.

The isolation requirement (4.2) exists because sandbox leakage is a common and damaging failure. A sandbox that shares any infrastructure with production (notification services, payment gateways, communication channels) will eventually cause real-world effects during testing. The consequences range from annoyance (false notifications) to catastrophic (real payments processed under test policy). True isolation means the sandbox has its own copy of every downstream service, or uses mock services that discard all outputs.

The data freshness requirement (4.3) addresses the stale data problem in Scenario C. A sandbox that replays 6-month-old data will miss population shifts, seasonal patterns, and changes in the input distribution. The 30-day maximum staleness ensures that the simulation reflects approximately current conditions, while allowing time for data preparation and anonymisation.

6. Implementation Guidance

Recommended patterns:

Production traffic replay. Capture production decision requests (with PII anonymised or pseudonymised) and replay them through both the current production policy and the proposed policy in the sandbox. Compare outcomes. This provides the most accurate impact prediction because it uses the actual distribution of inputs the system encounters. A replay of 50,000 decisions with a 2.3% outcome change rate produces a confidence interval narrow enough to distinguish a 2% impact from a 5% impact.
Isolated sandbox infrastructure. Deploy the sandbox as a complete replica of the production decision pipeline, with all downstream services replaced by mocks or sandboxed copies. The sandbox has its own database, its own notification sink (which discards or logs notifications), its own payment mock (which accepts but does not execute transactions), and its own API mocks for all external services. Network-level isolation prevents the sandbox from reaching production endpoints.
Quantitative impact dashboard. After each simulation run, generate a dashboard showing: total decisions replayed, number of changed outcomes, percentage change, breakdown by decision category (approval, rejection, escalation), breakdown by population segment (if applicable), and a comparison of the expected impact (from the policy change author's estimate) versus the measured impact. Flag discrepancies exceeding a defined threshold (e.g., measured impact more than 2x expected impact).
Continuous shadow mode. Before activating a policy change, run the proposed policy in parallel with production for a defined observation period (recommended: 48-72 hours). Every production decision is also evaluated (but not acted on) under the proposed policy. Differences are logged. This provides real-time impact measurement without risk, using actual live traffic rather than replayed historical traffic.
Anomalous interaction detection. During simulation, identify decisions where the outcome changed due to rule interactions that are not covered by any unit test (per AG-271). These are novel interactions revealed only by production-scale data. Flag them for review and, if validated, add corresponding interaction tests to the test suite.

Anti-patterns to avoid:

Sharing any infrastructure between sandbox and production. A shared database, a shared notification service, a shared payment gateway — any shared component creates a leakage path. Even read-only sharing is risky: a sandbox query that locks a production table can cause production delays.
Using synthetic data instead of representative data. Synthetic data generated from statistical distributions may miss real-world patterns, edge cases, and correlations present in actual decision traffic. Synthetic data has a role in privacy-constrained environments, but it should be validated against production distributions.
Testing only the changed rules. The purpose of the sandbox is to test the entire policy under realistic conditions, not just the changed rules. Rule interactions that cause unexpected outcomes only manifest when the full rule set is evaluated against diverse inputs.
Treating the sandbox as optional for "minor" changes. The classification of a change as "minor" is the policy author's judgment before testing. Many production incidents result from changes classified as minor that interact unexpectedly with existing rules. The sandbox simulation should be mandatory for all changes, with the simulation scope potentially scaled by change magnitude.
Running the sandbox after production deployment. The purpose of the sandbox is to catch problems before they reach production. A sandbox run after deployment provides diagnosis but not prevention.

Industry Considerations

Financial Services. The FCA expects firms to test changes to automated decision systems before deployment. For lending decisions, sandbox testing should include fair lending analysis: does the proposed policy change disproportionately affect any protected characteristic group? Replay data should be stratified by demographic attributes to detect disparate impact. The PRA's SS1/23 expects pre-deployment validation of model changes, which extends to policy rule changes.

Healthcare. Clinical decision support policy changes should be simulated against patient populations to measure the clinical impact: how many patients would receive different recommendations, and what is the clinical significance? Clinical safety review should be informed by sandbox results, not just rule-level analysis.

Critical Infrastructure. Safety-critical policy changes should be simulated against historical operational data including alarm conditions, fault scenarios, and boundary conditions. The simulation should include stress scenarios that test the policy under adverse conditions, not just normal operations.

Maturity Model

Basic Implementation — A sandbox environment exists, isolated from production. Policy changes are replayed against a sample of recent decision data before deployment. A quantitative impact report is generated and reviewed. The sandbox uses mock downstream services. Data is refreshed at least monthly.

Intermediate Implementation — Production traffic replay uses at least 10,000 decisions with full anonymisation. The impact dashboard includes population segment breakdowns. Anomalous rule interactions are flagged automatically. Continuous shadow mode is available for high-impact changes. The impact report is formally approved by a policy owner before activation. Fair lending / equalities impact analysis is included for customer-facing decisions.

Advanced Implementation — All intermediate capabilities plus: continuous shadow mode for all policy changes with a minimum 48-hour observation period. Real-time comparison between shadow and production outcomes. Automated regression detection flags when the shadow policy produces worse outcomes on any monitored metric. The sandbox environment is independently verified for isolation completeness. Formal approval workflow requires sign-off from compliance, risk, and the policy owner before activation.

7. Evidence Requirements

Required artefacts:

Sandbox isolation evidence. Architecture diagram or configuration export demonstrating that the sandbox has no connection to production systems (separate databases, mock downstream services, network-level isolation). Format: architecture document and network configuration.
Simulation results. The quantitative impact report from the most recent policy change simulation, showing: decisions replayed, outcomes changed, percentage impact, population segment breakdown, and anomalous interactions detected. Format: structured report.
Approval record. Evidence that the simulation results were reviewed and approved by an authorised policy owner before the policy change was activated in production. Format: approval workflow record with timestamps and identities.
Data freshness evidence. Evidence that the simulation data set was refreshed within the last 30 days, showing: data extraction date, record count, and any anonymisation steps applied. Format: data pipeline log.

Retention requirements:

Simulation results and approval records: aligned with policy version retention per AG-269. Sandbox configuration: retained for the life of the system.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Sandbox Isolation Verification

Stimulus: In the sandbox, execute a policy that triggers an external action (e.g., send a notification, initiate a payment, write to an external database). Monitor production systems for any corresponding action.
Expected behaviour: No action reaches production systems. The sandbox mock services log the action without executing it.
Pass criteria: Zero production side effects from sandbox actions.
Fail criteria: Any sandbox action produces a real-world effect in production.

Test 8.2: Impact Prediction Accuracy

Stimulus: Run a sandbox simulation for a policy change, record the predicted impact. Deploy the policy change to production. Compare the actual production impact against the sandbox prediction.
Expected behaviour: The sandbox prediction is within a defined tolerance of the actual impact (e.g., predicted 5% rejection increase, actual 4-6%).
Pass criteria: The actual production impact is within 2x of the sandbox prediction for the primary impact metric.
Fail criteria: The actual production impact exceeds 2x the sandbox prediction, indicating the simulation was not representative.

Test 8.3: Mandatory Simulation Gate

Stimulus: Attempt to activate a policy change in production without running a sandbox simulation.
Expected behaviour: The system blocks activation and requires a simulation run.
Pass criteria: No policy change is activated without a completed sandbox simulation.
Fail criteria: A policy change is activated without a sandbox simulation.

Test 8.4: Data Freshness Enforcement

Stimulus: Attempt to run a sandbox simulation with data older than 30 days.
Expected behaviour: The system warns that the data is stale and requires a data refresh or explicit override with justification.
Pass criteria: Stale data triggers a warning, and results carry a staleness disclaimer.
Fail criteria: Stale data is used without any warning or notification.

Test 8.5: Anomalous Interaction Detection

Stimulus: Introduce a policy change that interacts with an existing rule to change 500 outcomes that no unit test covers. Run the sandbox simulation.
Expected behaviour: The simulation flags the 500 changed outcomes as resulting from a rule interaction not covered by existing tests.
Pass criteria: The anomalous interaction is detected and highlighted in the impact report.
Fail criteria: The 500 changed outcomes are included in the total impact count but not flagged as resulting from an untested interaction.

Test 8.6: Approval Workflow Enforcement

Stimulus: Complete a sandbox simulation that shows a 15% impact (above the expected 3%). Attempt to activate the policy change without explicit approval.
Expected behaviour: The system blocks activation pending approval from an authorised policy owner.
Pass criteria: Activation is blocked until approval is recorded.
Fail criteria: The policy change is activated without approval despite the high-impact simulation result.

Test 8.7: Shadow Mode Accuracy

Stimulus: Run a policy change in shadow mode for 48 hours. Compare shadow outcomes against actual production outcomes for the same decisions.
Expected behaviour: The shadow produces deterministic outcomes that can be compared 1:1 with production outcomes. The comparison report shows which decisions would have changed.
Pass criteria: Shadow mode produces a complete, comparable set of outcomes for all production decisions during the observation period.
Fail criteria: Shadow mode misses decisions, produces non-deterministic outcomes, or cannot be compared to production.

Conformance Scoring

Score 0: No sandbox environment — policy changes are deployed directly to production without simulation.
Score 1: A sandbox exists but is not fully isolated from production, or uses data more than 30 days old, or simulation is not mandatory for all changes.
Score 2: Fully isolated sandbox with representative data (refreshed within 30 days), mandatory simulation for all policy changes, quantitative impact report with population breakdowns, and formal approval before activation.
Score 3: All Score 2 capabilities plus continuous shadow mode for all changes, real-time impact comparison, anomalous interaction detection, independently verified isolation, and formal multi-stakeholder approval workflow.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
PRA SS1/23	Model Risk Management — Pre-deployment Validation	Direct requirement
NIST AI RMF	MEASURE 2.5, MANAGE 2.2	Supports compliance
ISO 42001	Clause 8.4 (AI System Development)	Supports compliance
Equality Act 2010 / ECOA	Disparate Impact Testing	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that risk management measures be tested "with a view to identifying the most appropriate risk management measures." Sandbox simulation is the testing mechanism for policy changes — it reveals whether a proposed policy change achieves its intended effect without unacceptable side effects. The requirement to test against realistic scenarios (not just unit tests) maps directly to the sandbox's purpose.

PRA SS1/23 — Pre-deployment Validation

The PRA's supervisory statement expects firms to validate model changes before deployment. For policy rule changes, the sandbox provides the pre-deployment validation environment. The PRA expects validation to use representative data and to measure the quantitative impact of the change. Sandbox simulation directly implements this expectation.

Equality Act 2010 / ECOA — Disparate Impact Testing

For customer-facing decisions, policy changes must be tested for disparate impact on protected characteristic groups. The sandbox provides the environment for this testing by replaying production decisions (stratified by demographic attributes) through the proposed policy. A policy change that disproportionately affects a protected group triggers fair lending or equalities review.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	All decisions affected by the untested policy change — potentially all production decisions

Consequence chain: Without sandbox simulation, policy changes are deployed based on the policy author's judgment and unit test results. The immediate technical failure is a policy change that produces unexpected effects at production scale — higher rejection rates, disparate impact on population segments, or emergent rule interactions. The operational impact is felt immediately: in Scenario A, 1,760 additional rejections in the first week. The business consequence includes customer harm (legitimate applicants rejected), reputational damage (social media complaints), and revenue loss (the organisation estimated £440,000 in lost revenue from the over-rejection). The regulatory consequence depends on the nature of the unexpected effect: if the policy change disproportionately affects a protected group, the consequence is an equality or fair lending investigation; if the change causes systematic non-compliance, the consequence is a regulatory enforcement action. If the sandbox leaks to production (Scenario B), the consequence is operational disruption and loss of trust in the governance process. The compounding factor is that policy changes without sandbox testing accumulate: each untested change adds risk, and the eventual production failure may result from the interaction of multiple untested changes.

Cross-references: AG-270 (Policy Compilation Verification Governance) verifies that compiled policy matches the source; AG-275 extends verification to production-scale impact. AG-271 (Rule-Test Coverage Governance) provides unit-level test coverage; AG-275 complements this with integration-level testing against realistic data. AG-269 (Policy Version Pinning Governance) ensures that sandbox results are linked to specific policy versions. AG-273 (Temporal Policy Trigger Governance) introduces time-dependent policy changes that should be simulated in the sandbox. AG-274 (Geographic Policy Trigger Governance) introduces jurisdiction-specific variants that require jurisdiction-stratified sandbox testing. AG-278 (Policy Hot-Patch Rollback Governance) provides the rollback mechanism when sandbox testing is bypassed for emergency changes. AG-134 (Machine-Checkable Policy Semantics) enables automated comparison of outcomes. AG-138 (High-Assurance Invariant Verification) provides formal verification of sandbox isolation properties. AG-007 (Governance Configuration Control) governs changes to the sandbox environment configuration. AG-136 (Independent Control-Plane Separation) supports the isolation requirement.

Cite this protocol

AgentGoverning. (2026). AG-275: Policy Simulation Sandbox Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-275

← Previous Protocol

AG-274

Geographic Policy Trigger Governance

Next Protocol →

AG-276

Policy Explainability Schema Governance