AG-725

Shadow and Parallel Testing Governance

Supplementary Core & Adversarial Model Resistance ~25 min read AGS v2.1 · April 2026
EU AI Act NIST ISO 42001

Section 2: Summary

This dimension governs the mandatory use of shadow and parallel execution modes for new or materially modified agents before they are permitted to handle live production traffic with real consequences, requiring that candidate agent outputs be systematically compared against baseline production outputs across statistically representative workloads. The control exists because AI agents exhibit emergent, context-dependent behaviours that cannot be fully characterised through offline evaluation alone; a candidate agent may perform well in synthetic benchmarks while introducing subtle regressions, distributional shifts, or safety violations that only manifest under the complexity and volume of real production traffic patterns. Failure to enforce this gate results in silent quality degradation, undiscovered safety boundary violations, financial loss from erroneous automated decisions, and irreversible harm to users who encounter the unvalidated agent before its deficiencies are detected.

Section 3: Example

Scenario A — Financial-Value Agent, Trade Execution Context

A tier-one asset management firm deploys a revised portfolio rebalancing agent intended to improve execution timing recommendations. The change is classified internally as a minor prompt update and skipped the shadow testing gate on the grounds that no model weights changed. In live production, the updated agent begins interpreting a specific class of multi-currency position data incorrectly — a behaviour not present in the prior version — and recommends executing 47 rebalancing trades across client accounts with an aggregate nominal value of USD 2.3 billion in a three-hour window. The systematic directional error causes an average 0.6% adverse slippage across affected portfolios before a risk operations team detects the anomaly. Had shadow testing been run for 48 hours against replayed production traffic, the divergence in output distribution — specifically the recommendation sign-flip on cross-currency hedges — would have been flagged at a statistical divergence score of 4.7 sigma above the established baseline, triggering mandatory hold before cutover.

Scenario B — Safety-Critical / CPS Agent, Industrial Control Environment

A robotics manufacturer deploys an updated path-planning agent to 312 assembly-line robotic arms across three facilities. The update includes a fine-tuned collision-avoidance sub-model retrained on expanded sensor data. Shadow testing was performed but limited to synthetic scenarios generated from a held-out training set rather than real production sensor streams. During the first production shift, 18 robots encounter an edge-case interaction pattern involving simultaneous conveyor belt speed changes and ambient vibration that was not represented in the synthetic corpus. The updated path-planning agent chooses trajectories that reduce inter-robot clearance to below the 150mm safety threshold mandated by the facility's safety case, triggering an emergency stop event that halts production for 11.4 hours, damages tooling on four arms, and results in a mandatory regulator notification under the facility's PSSR 2000 obligations. A parallel testing regime using two weeks of captured production sensor telemetry would have encountered this interaction class approximately 340 times, exposing the regression before cutover.

Scenario C — Public Sector / Rights-Sensitive Agent, Benefits Entitlement Processing

A national social benefits agency replaces its entitlement eligibility determination agent following a legislative amendment that adjusts income thresholds for two benefit categories. The updated agent is tested against a static synthetic dataset of 5,000 cases assembled by the development team. Shadow testing against live application traffic is not performed. After cutover, the agent begins generating incorrect eligibility decisions at a rate of approximately 1.4% — affecting 2,200 applicants over the first 30 days, of which 61% receive incorrect denials. The agency receives 847 formal appeals within six weeks, each triggering a statutory reconsideration process. A comparison analysis run against 14 days of production traffic prior to cutover, with mandatory output-level divergence scoring, would have detected the systematic error within the income-threshold boundary cases at a divergence rate of 2.1% against the incumbent agent — well above the 0.25% tolerance threshold that would have held the cutover.

Section 4: Requirement Statement

4.0 Scope

This dimension applies to all agents that receive, process, or act upon production data or interact with real users, systems, financial instruments, physical actuators, or rights-bearing determinations. It applies upon initial deployment to a production environment and upon any change that constitutes a material modification, defined as any modification to model weights, prompt structure, tool bindings, decision logic, output post-processing, or data schema interpretation that could alter the agent's response distribution. It applies to all profiles listed in Section 1 Header Metadata. Organisations operating agents exclusively in isolated research sandboxes with no production traffic ingestion are exempt from Sections 4.2 through 4.6 but remain subject to 4.1, 4.7, 4.8, and 4.9. The requirements below use RFC 2119 terminology.

4.1 Shadow and Parallel Mode Definition and Classification

The organisation MUST maintain a written definition of shadow mode and parallel mode that distinguishes between them operationally. Shadow mode is defined as an execution configuration in which the candidate agent receives a copy of live production traffic, processes it independently, and produces outputs that are recorded and compared but never returned to users or acted upon by downstream systems. Parallel mode is defined as an execution configuration in which both the candidate agent and the incumbent production agent process the same input concurrently, with the incumbent's output taking effect and the candidate's output being held for comparison. The organisation MUST classify each shadow or parallel test engagement at initiation using a risk tier that corresponds to the agent's profile tier (High-Risk/Critical, Elevated, or Standard), and this classification MUST determine the minimum duration, traffic volume, and divergence thresholds applicable to the engagement as specified in 4.3.

4.2 Mandatory Shadow or Parallel Phase Before Cutover

Any new agent or materially modified agent MUST complete a shadow or parallel testing phase before being permitted to serve as the live production agent for any profile listed in Section 1. The cutover gate MUST be a formal control point in the deployment pipeline that cannot be bypassed without explicit documented override approval from a named accountable owner at or above the authority level defined in the organisation's AI governance charter. Bypasses of the cutover gate MUST be recorded as exceptions in the organisation's risk register with a written justification, a time-bounded remediation plan, and compensating controls. Emergency hotfix deployments that cannot complete a full shadow phase MUST still satisfy a minimum abbreviated comparison protocol as defined in 4.6.

4.3 Traffic Volume and Duration Minimums

The shadow or parallel phase MUST meet minimum exposure thresholds calibrated to the agent's risk tier. For High-Risk/Critical tier agents, the minimum phase duration MUST be seven calendar days of continuous exposure to live production traffic, with a minimum of 10,000 unique input instances processed, or 100% of observed unique input type classes present in the trailing 30-day production log if that volume is not reached within seven days. For Elevated tier agents, the minimum MUST be three calendar days and 2,000 unique input instances. For Standard tier agents, the minimum MUST be 24 hours and 500 unique input instances. Where production traffic volume is insufficient to meet instance thresholds within the calendar duration, the organisation MUST supplement with replayed historical production traffic drawn from the 90 days preceding the shadow engagement, documented as such in the test record.

4.4 Comparison Analysis Framework

The organisation MUST define and apply a comparison analysis framework that produces quantified divergence metrics between candidate and incumbent outputs for every shadow or parallel engagement. The framework MUST include, at minimum: (a) an output-level semantic similarity score, (b) a rate-of-disagreement metric counting instances where candidate and incumbent outputs differ by more than a defined equivalence threshold, (c) a distributional divergence statistic — at minimum Jensen-Shannon divergence or Kullback-Leibler divergence — computed over the output distribution of the candidate versus the incumbent, and (d) for agents that produce safety-bearing, financial-value, or rights-affecting outputs, a separate targeted comparison on the subset of inputs that fall within defined high-consequence decision zones. The framework MUST also capture output latency distributions for both agents, as latency regression constitutes a material operational change independent of output quality. All metrics MUST be computed and stored per-engagement at a minimum daily granularity.

4.5 Divergence Thresholds and Promotion Criteria

The organisation MUST establish documented divergence thresholds for each agent profile that, if exceeded by the candidate agent, automatically block promotion to live production. Threshold values MUST be set before the shadow or parallel phase begins and MUST NOT be adjusted upward after the phase has started without an exception approval recorded under 4.2. For High-Risk/Critical tier agents, the rate-of-disagreement threshold MUST not exceed 0.25% of total processed instances unless the divergences are fully characterised, attributed to intentional improvements, and individually reviewed by a qualified domain expert. Outputs in high-consequence decision zones where divergence is detected MUST be individually reviewed regardless of aggregate rate. The organisation MUST document the expected divergence rationale — explaining why the candidate is expected to differ from the incumbent in specific ways — at the start of the phase, and the post-phase comparison MUST confirm that actual divergences do not exceed or contradict the rationale.

4.6 Abbreviated Protocol for Emergency Deployments

Where an emergency deployment is required — defined as a production incident requiring a fix within a timeframe that precludes completion of the full phase defined in 4.3 — the organisation MUST execute an abbreviated comparison protocol that includes, at minimum: replay of the last 24 hours of production traffic against the candidate agent, automated divergence scoring against the incumbent on all replayed inputs, review of all divergences above the high-consequence threshold by a named approver, and sign-off by the accountable owner. The abbreviated protocol MUST be completed before the candidate is promoted even under emergency conditions, except where a total production outage would result from delay, in which case the promotion MAY proceed immediately but the abbreviated protocol MUST be completed retrospectively within four hours and its results used to inform an immediate rollback decision if thresholds are exceeded. All emergency deployments using this exception MUST be reported to the governance oversight function within 24 hours.

4.7 Data Handling During Shadow Execution

The organisation MUST ensure that production data used during shadow or parallel testing is handled in accordance with all applicable data protection obligations. Shadow mode processing MUST NOT expose production personal data to any system or personnel that would not have access to it in normal production operation. Where production data cannot lawfully be used in shadow mode, the organisation MUST use a statistically equivalent anonymised or synthetic proxy corpus that has been validated for representational fidelity before the shadow phase begins. The candidate agent's processing of production data during shadow mode MUST be logged and those logs MUST be subject to the same access controls and retention policies as production logs.

4.8 Artefact Capture and Chain of Custody

For each shadow or parallel engagement, the organisation MUST capture and retain a complete engagement record comprising: the candidate agent version identifier, the incumbent agent version identifier, the traffic volume processed, the date range of the engagement, all divergence metrics computed per 4.4, the promotion decision and the identity of the approver, and any exception records generated. This record MUST be immutable once the promotion decision is recorded — no field MUST be editable after sign-off — and MUST be stored in the organisation's governance evidence repository for a minimum of five years or the period required by the longest applicable regulatory retention obligation, whichever is greater.

4.9 Continuous Shadow Monitoring Post-Cutover

Following live cutover, the organisation SHOULD maintain shadow execution of the prior incumbent agent for a post-cutover stability window of no less than 72 hours for Elevated and Standard tier agents and no less than seven days for High-Risk/Critical tier agents. During this window, divergences that exceed the thresholds defined in 4.5 MUST trigger an automatic alert to the accountable owner and the on-call operations team. If divergence exceeds 150% of the approved threshold at any point during the post-cutover window, the organisation MUST initiate rollback evaluation within two hours and complete a rollback decision — either confirmed rollback or documented continuation with compensating controls — within four hours. The organisation MAY choose to run permanent shadow comparison between successive agent versions as a continuous monitoring control; where this is done, the output of continuous shadow comparison SHOULD be fed into the behavioural drift detection programme governed by AG-011.

Section 5: Rationale

Structural Enforcement

Shadow and parallel testing governance addresses a fundamental gap between pre-production evaluation and production reality. Offline evaluation — no matter how comprehensive — samples from a corpus that differs from live production in distributional coverage, temporal dynamics, adversarial edge cases, and the emergent complexity of real user interaction patterns. The structural requirement for a mandatory production-traffic comparison phase before cutover closes this gap by forcing the candidate agent into contact with the actual input distribution it will face, under conditions that do not expose real users or downstream systems to unvalidated outputs. This is not a quality assurance nicety; for High-Risk/Critical tier agents it is a structural safety gate equivalent in function to the pre-release testing gates applied to software in safety-critical domains under IEC 61508, DO-178C, and equivalent standards. Without the gate, the organisation has no empirical basis for the claim that the candidate agent's behaviour in production will be acceptably similar to the incumbent's — and that claim is the foundation of every risk acceptance statement made in the deployment approval.

Behavioural Enforcement

Beyond structural gating, the comparison analysis framework in 4.4 and the divergence thresholds in 4.5 establish a quantitative behavioural contract for deployment. They make explicit, before the phase begins, what the organisation expects the candidate agent to do differently from the incumbent and by how much. This disciplines the engineering and product organisation to think precisely about intended versus unintended change — a distinction that is frequently collapsed in practice, where teams approve changes on the basis of benchmark improvement without characterising the full output distribution shift. The requirement to set thresholds before the phase begins and to prohibit upward adjustment after the fact eliminates the post-hoc rationalisation failure mode, where teams that observe higher-than-expected divergence retrospectively widen thresholds to achieve promotion. The rationale documentation requirement — explaining at the outset what divergences are expected and why — creates a testable hypothesis structure that forces teams to understand their own change before they deploy it.

Why This Control Is Necessary at High-Risk/Critical Tier

The profile breadth of AG-725 — spanning all ten primary profiles including Safety-Critical/CPS, Financial-Value, and Rights-Sensitive agents — reflects the reality that the failure modes of unvalidated deployment are not confined to a single domain. In each of these contexts, the consequences of silent behavioural regression are severe, often irreversible, and frequently invisible until aggregate harm has already accumulated at scale. Shadow and parallel testing provides the only empirically grounded assurance that the candidate agent's behaviour under production conditions meets the standard required for the risk acceptance statements that govern its deployment. The seven-day minimum for High-Risk/Critical tier agents is calibrated to span at least one full weekly traffic cycle, ensuring exposure to the temporal variation in input patterns — time-of-day, day-of-week, and seasonal spikes — that shorter windows routinely miss.

Section 6: Implementation Guidance

Traffic Mirroring at the Ingress Layer. The most robust shadow mode architecture mirrors production requests at the network ingress point — ahead of authentication and session state injection — and routes a copy to the shadow agent infrastructure. This ensures that the shadow agent receives inputs identical to those processed by the incumbent, including all upstream transformations, without interfering with the production request path. The shadow agent's response is written to a comparison store and never returned to the calling client. This pattern guarantees zero user impact from shadow execution and eliminates the risk of shadow-side failures affecting production availability.

Asynchronous Comparison Pipeline. Divergence metrics should be computed asynchronously against a comparison store rather than inline with the production request path, to prevent shadow infrastructure latency from affecting production performance. A dedicated comparison worker consumes from the store, computes all metrics defined in 4.4, and writes results to a time-series metrics store with per-instance records retained for the engagement duration. Aggregated dashboards should show rolling divergence rates updated at least every 15 minutes, with threshold breach alerts delivered to named owners via the organisation's incident alerting channel.

Stratified Sampling for High-Volume Environments. In environments processing millions of requests per day, it may be operationally impractical to shadow 100% of traffic. In these cases, the organisation should implement stratified sampling that guarantees representation across all known input type classes, user segments, and high-consequence decision zones. The sampling rate must be documented and the total sampled volume must meet the minimums in 4.3 within the required duration. A sampling approach that meets instance count but under-represents rare but high-consequence input classes is not compliant with 4.3's intent; stratification must be verified by input class coverage analysis before the phase closes.

Expected Divergence Hypothesis Documentation. Before initiating each shadow engagement, the engineering and product team should produce a written divergence rationale document — typically two to four pages — that identifies every intentional behavioural change in the candidate agent, the expected direction and magnitude of output change for each, the input types on which each change is expected to manifest, and a rationale for why the change represents an improvement. This document serves as the comparison baseline: post-phase analysis confirms that actual divergences match the predicted profile, and any divergences outside the predicted profile are flagged for manual review regardless of their aggregate rate.

Canary Promotion as a Bridge Pattern. For organisations that must reduce time-to-production for lower-risk changes, a canary promotion pattern can bridge shadow mode and full cutover. After completing the mandatory shadow phase, the candidate is promoted to handle a small percentage of live traffic — typically 1–5% — with the incumbent handling the remainder. Divergences detected during canary exposure are captured using the same comparison framework. This pattern is not a substitute for the full shadow phase; it is an optional subsequent stage that reduces blast radius during initial live exposure. The canary phase does not reset or replace the shadow phase obligations under 4.2 and 4.3.

Explicit Anti-Patterns

Synthetic-Only Shadow Corpora. Performing shadow testing exclusively against synthetic test cases — even large ones — does not satisfy the requirements of this dimension. Synthetic corpora systematically under-represent the long-tail distributional complexity of real production traffic and cannot replicate the adversarial, ambiguous, or malformed inputs that real users generate. This is the failure mode illustrated in Scenario B. Synthetic corpora are acceptable only as supplements to real production traffic or as proxies where production data cannot be used per 4.7, and in the latter case only when validated for representational fidelity.

Post-Hoc Threshold Adjustment. Adjusting divergence thresholds upward after observing the candidate's performance during the shadow phase is a governance failure, not a calibration exercise. It converts the threshold from a pre-committed risk tolerance into a post-hoc acceptance criterion and eliminates the control's ability to block unacceptable candidates. This anti-pattern frequently presents as legitimate calibration language — "we observed that the new model naturally has higher semantic variation; we need to recalibrate the threshold" — but represents threshold gaming. Section 4.5 explicitly prohibits this pattern.

Shadow Mode Without Comparison Output Storage. Running a candidate agent against production traffic but discarding the outputs without storing them for comparison analysis provides no assurance value. This pattern — sometimes adopted to reduce storage costs — means the organisation cannot demonstrate that a comparison was actually performed, cannot retrieve individual divergent instances for manual review, and cannot satisfy the artefact requirements of Section 7. All shadow outputs must be retained for the engagement duration.

Skipping Shadow Phase for "Minor" Changes. The classification of a change as minor — particularly for prompt changes, system instruction updates, or tool binding modifications — is a frequent source of control bypass. Prompt changes can produce significant output distribution shifts that are invisible to static code analysis but immediately apparent in production comparison data. The determination of whether a change is material (requiring the full shadow phase) must be made against the definition in 4.0 by a qualified reviewer, not unilaterally by the engineering team proposing the change.

Using the Shadow Phase as a Benchmark Replacement. Shadow and parallel testing is not a substitute for comprehensive offline evaluation and red-teaming; it is an additional layer. Organisations that treat shadow mode as their primary evaluation mechanism — replacing benchmark evaluation, adversarial testing, and safety assessment — reduce their ability to detect defects before any production traffic exposure and create unacceptable risk at the start of the shadow phase itself. The shadow phase should be the final gate after all prior evaluation stages have been completed.

Maturity Model

Maturity LevelCharacteristics
Level 1 — Ad HocShadow testing performed informally for some deployments; no standard comparison framework; thresholds set inconsistently or not at all; no formal cutover gate
Level 2 — DefinedShadow phase mandated by policy for all material changes; comparison framework defined and documented; thresholds set before each phase; cutover gate exists with named approver
Level 3 — ManagedAutomated comparison pipeline operational; divergence metrics computed and alerted in real time; engagement records immutably stored; stratified sampling verified per engagement; post-cutover shadow monitoring active
Level 4 — OptimisedContinuous shadow monitoring integrated with behavioural drift detection (AG-011); expected divergence hypotheses systematically reviewed and used to improve pre-production evaluation; shadow testing data fed back into training corpus quality management; comparison framework coverage reviewed and expanded annually

Section 7: Evidence Requirements

Required Artefacts

Shadow Engagement Initiation Record. A document produced before the start of each shadow or parallel phase, containing: the candidate agent version identifier and its change description, the incumbent agent version identifier, the risk tier classification, the minimum duration and instance thresholds applicable, the divergence thresholds set for the engagement, the expected divergence rationale, the data handling approach (live traffic, anonymised, or synthetic proxy with fidelity validation), the named accountable owner, and the planned start and end dates. This document must be stored before the phase begins and must not be modified after the phase starts except to record phase start confirmation.

Daily Comparison Metrics Reports. Per-day aggregated outputs of the comparison analysis framework for the duration of the engagement, including output-level semantic similarity scores, rate-of-disagreement, distributional divergence statistics, latency comparison, and high-consequence zone subset analysis. These must be stored in the governance evidence repository within 24 hours of the close of each day of the engagement.

Per-Instance Divergence Log. A queryable record of every instance where the candidate agent output diverged from the incumbent output beyond the equivalence threshold, including the input (or a reference to the input, if the input contains personal data subject to access controls), both outputs, the computed divergence score, and a flag indicating whether the instance falls within a high-consequence decision zone. This log must be complete at engagement close and retained per the schedule below.

Shadow Engagement Completion and Promotion Decision Record. A document produced at the close of each engagement containing: aggregate metrics for the full engagement period, comparison against thresholds, identification of all divergences reviewed manually, the promotion decision (promote, hold, or reject), the name and role of the approver, the date of sign-off, and any exceptions recorded. This document must be signed by the accountable owner and must be immutable after sign-off.

Exception and Override Records. For each cutover gate bypass, abbreviated emergency protocol invocation, or post-hoc threshold adjustment exception, a separate record containing the justification, the approving authority, the compensating controls applied, and the remediation plan with a target close date.

Post-Cutover Stability Window Reports. Daily divergence reports for the post-cutover monitoring window defined in 4.9, including any alert events, rollback evaluations triggered, and their outcomes.

Retention Periods

All artefacts listed above MUST be retained for a minimum of five years from the date of the promotion decision, or the longest applicable regulatory retention obligation for the agent's domain (e.g., MiFID II seven-year requirement for financial instrument-related records; health records retention obligations for medical-domain agents), whichever is greater. Per-instance divergence logs containing personal data must be managed under the organisation's data retention policy and may be pseudonymised or aggregated for long-term retention if individual-level retention is not required by a specific regulatory obligation, provided the aggregate metrics remain fully available.

Section 8: Test Specification

Each test maps to one or more MUST requirements in Section 4. Conformance is scored on a 0–3 scale: 3 = Fully Conformant, 2 = Substantially Conformant with minor gaps, 1 = Partially Conformant with material gaps, 0 = Non-Conformant or no evidence.

Test 8.1 — Shadow/Parallel Mode Definition and Classification Verification Maps to: 4.1

Objective: Verify that the organisation maintains a written, operationally adequate definition of shadow mode and parallel mode, and that each engagement is formally risk-tier classified at initiation.

Method: Request the organisation's operational definition document for shadow and parallel mode. Verify that it distinguishes between the two modes in operational terms consistent with the definitions in 4.1. Select five most recent shadow or parallel engagements from the governance evidence repository. For each, verify that a risk tier classification exists in the initiation record, that the classification corresponds to the agent's profile tier, and that the minimum thresholds applicable to that tier are recorded in the initiation document.

Pass Criteria (score 3): Written definition exists, clearly distinguishes shadow from parallel mode, and all five sampled engagements have complete tier classifications with correct threshold assignments. Score 2: Definition exists and is adequate but minor gaps in tier classification documentation in one or two sampled engagements. Score 1: Definition exists but is operationally ambiguous, or tier classification is missing or incorrect in three or more sampled engagements. Score 0: No written definition exists, or shadow and parallel modes are conflated without operational distinction.

Test 8.2 — Mandatory Phase Completion and Cutover Gate Control Maps to: 4.2

Objective: Verify that no agent was promoted to live production without completing a shadow or parallel phase and passing a documented cutover gate, and that all exceptions are recorded in the risk register.

Method: Obtain the list of all agent deployments and material modifications in the preceding 12 months from the deployment pipeline records. Cross-reference against the governance evidence repository to confirm that a shadow engagement completion and promotion decision record exists for each. Identify any deployments with no corresponding shadow engagement record. For each identified gap, verify whether an exception record exists in the risk register with a written justification, time-bounded remediation plan, and compensating controls. Attempt to identify any mechanism in the deployment pipeline that technically enforces the cutover gate (e.g., a required approval step that cannot be bypassed without recorded override).

Pass Criteria (score 3): All deployments in scope have corresponding shadow engagement completion records or documented exceptions; the deployment pipeline contains a technical control enforcing the cutover gate; all exception records are complete. Score 2: One or two deployments lack a shadow engagement record but have complete exception records; pipeline gate exists but is not fully automated. Score 1: Three or more deployments lack shadow engagement records, or exceptions are recorded without compensating controls or remediation plans. Score 0: No systematic enforcement of the cutover gate; multiple deployments with no shadow engagement and no exception record.

Test 8.3 — Traffic Volume and Duration Threshold Compliance Maps to: 4.3

Objective: Verify that shadow or parallel phases for High-Risk/Critical tier agents meet the seven-calendar-day and 10,000-instance minimums, and that supplementary replayed traffic is documented where instance thresholds were not reached organically.

Method: Select all shadow engagements for High-Risk/Critical tier agents in the preceding 12 months. For each, verify from the engagement completion record: phase start and end dates (minimum seven calendar days), total unique input instances processed (minimum 10,000), whether the instance threshold was met from live traffic or supplemented with replayed historical traffic, and if supplemented, whether the supplementary corpus is documented as drawn from the preceding 90 days and noted as such in the record. Apply equivalent checks for Elevated and Standard tier agents using their respective thresholds.

Pass Criteria (score 3): All sampled High-Risk/Critical engagements meet duration and instance thresholds; supplementary traffic use is properly documented; Elevated and Standard tier engagements meet their respective thresholds. Score 2: One engagement fails the instance threshold but the shortfall is minor (<10%) and documented; no safety-bearing outputs were affected. Score 1: Multiple engagements fail the instance threshold without documented justification, or duration minimums are not met for High-Risk/Critical agents. Score 0: No evidence that traffic volume or duration thresholds are tracked or enforced.

Test 8.4 — Comparison Analysis Framework Coverage and Metric Completeness Maps to: 4.4

Objective: Verify that the comparison analysis framework produces all four required metric categories (output-level semantic similarity, rate-of-disagreement, distributional divergence statistic, latency distribution) and that high-consequence zone subset analysis is applied for appropriate agent profiles.

Method: Request the technical specification of the comparison analysis framework. Verify that all four metric categories in 4.4(a)–(d) are implemented and that the distributional divergence statistic used is identified as Jensen-Shannon divergence, Kullback-Leibler divergence, or an equivalent metric documented with its formula. Select three recent engagement daily comparison metrics reports and verify that all four metric categories are present, computed at daily granularity. For any engagement involving a Financial-Value, Safety-Critical, Rights-Sensitive, or Crypto/Web3 agent, verify that a high-consequence zone subset analysis is present in the reports. Verify that latency distributions are captured for both candidate and incumbent in each report.

Pass Criteria (score 3): All four metric categories present in framework specification and in all sampled reports; distributional divergence statistic identified and documented; high-consequence zone analysis present for all applicable engagements; latency distributions captured. Score 2: All four categories present in framework but one category has minor gaps in daily report completeness (e.g., latency data missing on two days of a 14-day engagement). Score 1: One metric category is absent from the framework or systematically absent from reports; high-consequence zone analysis absent for applicable agents. Score 0: Comparison analysis framework does not exist as a defined specification or produces only ad hoc metrics.

Test 8.5 — Divergence Threshold Pre-Commitment and Promotion Criteria Enforcement Maps to: 4.5

Objective: Verify that divergence thresholds are set before each shadow phase begins, are not adjusted upward after the phase starts, and that the promotion decision is consistent with threshold outcomes.

Method: For each sampled shadow engagement, verify from the initiation record that divergence thresholds are documented with a date stamp before the phase start date. Compare the thresholds in the initiation record against the thresholds referenced in the engagement completion record — any upward change must be accompanied by an exception approval recorded under 4.2. For engagements where the candidate was promoted, verify that final divergence metrics do not exceed the thresholds set at initiation. For engagements where divergence exceeded thresholds, verify that promotion was blocked or an exception was recorded with manual review evidence. Verify that the expected divergence rationale document exists for each engagement.

Pass Criteria (score 3): Thresholds set before phase start in all sampled engagements; no upward threshold adjustments without exception approval; all promotion decisions consistent with threshold outcomes; expected divergence rationale documents present. Score 2: One engagement has a minor threshold adjustment with exception approval that is procedurally complete but not clearly necessary. Score 1: One or two engagements have upward threshold adjustments without exception approval, or a promotion decision proceeds despite threshold exceedance without documented justification. Score 0: Thresholds are not set before phases begin, or threshold compliance is not checked as part of the promotion decision.

Test 8.6 — Abbreviated Emergency Protocol Adequacy Maps to: 4.6

Objective: Verify that any emergency deployments that bypassed the full shadow phase completed the abbreviated protocol requirements and were reported to the governance oversight function within 24 hours.

Method: Request the list of all deployments classified as emergency deployments in the preceding 12 months. For each, verify that: 24-hour production traffic replay was performed and documented, automated divergence scoring against the incumbent was completed on all replayed inputs, divergences above high-consequence thresholds were individually reviewed by a named approver, the accountable owner signed off, and the deployment was reported to the governance oversight function within 24 hours. For any case where promotion proceeded before the abbreviated protocol was completed due to total production outage, verify that the retrospective protocol was completed within four hours and that a rollback evaluation was documented.

Pass Criteria (score 3): All emergency deployments fully documented per abbreviated protocol; all reported within 24 hours; retrospective completions within four hours where applicable; rollback decisions documented. Score 2: One emergency deployment

Section 9: Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 9 (Risk Management System)Direct requirement
EU AI ActArticle 15 (Accuracy, Robustness and Cybersecurity)Direct requirement
NIST AI RMFGOVERN 1.1, MAP 3.2, MANAGE 2.2Supports compliance
ISO 42001Clause 6.1 (Actions to Address Risks), Clause 8.2 (AI Risk Assessment)Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers of high-risk AI systems to establish and maintain a risk management system that identifies, analyses, estimates, and evaluates risks. Shadow and Parallel Testing Governance implements a specific risk mitigation measure within this framework. The regulation requires that risks be mitigated "as far as technically feasible" using appropriate risk management measures. For deployments classified as high-risk under Annex III, compliance with AG-725 supports the Article 9 obligation by providing structural governance controls rather than relying solely on the agent's own reasoning or behavioural compliance.

EU AI Act — Article 15 (Accuracy, Robustness and Cybersecurity)

Article 15 requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity. Shadow and Parallel Testing Governance directly supports the robustness and cybersecurity requirements by implementing structural controls that resist adversarial manipulation and ensure system integrity under attack conditions.

NIST AI RMF — GOVERN 1.1, MAP 3.2, MANAGE 2.2

GOVERN 1.1 addresses legal and regulatory requirements; MAP 3.2 addresses risk context mapping; MANAGE 2.2 addresses risk mitigation through enforceable controls. AG-725 supports compliance by establishing structural governance boundaries that implement the framework's approach to AI risk management.

ISO 42001 — Clause 6.1, Clause 8.2

Clause 6.1 requires organisations to determine actions to address risks and opportunities within the AI management system. Clause 8.2 requires AI risk assessment. Shadow and Parallel Testing Governance implements a risk treatment control within the AI management system, directly satisfying the requirement for structured risk mitigation.

Section 10: Failure Severity

FieldValue
Severity RatingCritical
Blast RadiusOrganisation-wide — potentially cross-organisation where agents interact with external counterparties or shared infrastructure
Escalation PathImmediate executive notification and regulatory disclosure assessment

Consequence chain: Without shadow and parallel testing governance, the governance framework has a structural gap that can be exploited at machine speed. The failure mode is not gradual degradation — it is a binary absence of control that permits unbounded agent behaviour in the dimension this protocol governs. The immediate consequence is uncontrolled agent action within the scope of AG-725, potentially cascading to dependent dimensions and downstream systems. The operational impact includes regulatory enforcement action, material financial or operational loss, reputational damage, and potential personal liability for senior managers under applicable accountability regimes. Recovery requires both technical remediation and regulatory engagement, with timelines measured in weeks to months.

Cite this protocol
AgentGoverning. (2026). AG-725: Shadow and Parallel Testing Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-725