Forensic Replay and Evidence Preservation Governance requires that every AI agent deployment captures and preserves sufficient operational data to reconstruct the complete sequence of events leading to, during, and after any serious incident — to a level of fidelity that supports root cause analysis, regulatory investigation, and legal proceedings. The preserved evidence must include the agent's inputs (instructions, data feeds, API responses), reasoning process (chain-of-thought, intermediate computations, decision points), outputs (actions taken, communications sent, data modified), and environmental context (model version, configuration state, system resource utilisation). Evidence must be preserved in a tamper-evident, cryptographically verifiable format per AG-006, with chain-of-custody documentation sufficient for regulatory and legal admissibility. Critically, the evidence preservation mechanism must operate continuously — not activated at incident detection — because the evidence most needed for root cause analysis is the data captured before the incident was detected.
Scenario A — Reasoning Chain Not Captured Prevents Root Cause Determination: A safety-critical AI agent monitoring a water treatment facility adjusts chlorine dosing based on real-time sensor data. Over a 3-hour period, the agent progressively increases chlorine concentration from the target of 1.5 mg/L to 4.8 mg/L — well above the safe limit of 4.0 mg/L. The incident is detected when a downstream sensor triggers a high-chlorine alarm. The agent is contained per AG-065. Investigation begins, but the agent's reasoning chain was not captured — only its final dosing commands were logged. The investigation can see what the agent did but not why. The sensor data shows no anomaly that would justify the dosing increase. Without the reasoning chain, the investigation cannot determine whether the agent misinterpreted sensor data, received corrupted data that was subsequently corrected, experienced a model drift, or was influenced by an adversarial input. The root cause remains undetermined, and the organisation cannot demonstrate to the regulator that the remediation addresses the actual failure mode.
What went wrong: The operational logging captured inputs and outputs but not the reasoning process. The agent's chain-of-thought, the intermediate calculations that mapped sensor readings to dosing decisions, and the internal state that accumulated over the 3-hour period were all lost when the agent was contained. The organisation had the endpoints of the causal chain (sensor data in, dosing commands out) but not the middle (the reasoning that connected them). Consequence: inability to determine root cause, regulatory finding for inadequate forensic capability, inability to demonstrate remediation effectiveness, extended shutdown of the AI-assisted dosing system pending manual investigation, potential public health investigation.
Scenario B — Evidence Tampering Undermines Regulatory Investigation: A financial-value AI agent is under investigation for potential market manipulation after executing a series of trades that appeared to create artificial price momentum. The regulator requests the complete audit trail of the agent's decision-making process for the 48-hour period in question. The organisation produces the logs, but the regulator's forensic analysis reveals that 14 log entries have timestamps that are inconsistent with the surrounding entries — they appear to have been inserted after the fact. The log storage system used mutable database records with no cryptographic integrity protection. Whether the inconsistencies are due to actual tampering or a benign system issue (e.g., log replication lag), the regulator cannot rely on the evidence, and the investigation shifts from "did the agent manipulate the market?" to "did the organisation tamper with evidence?"
What went wrong: The evidence storage did not provide tamper-evidence per AG-006. Log records were stored in a mutable database that permitted insertion, modification, and deletion without detection. Even if no tampering occurred, the absence of integrity protection means the evidence cannot be verified as authentic. Consequence: regulatory investigation escalated from operational conduct to evidence integrity, potential obstruction finding, criminal referral for evidence tampering (even if no tampering occurred, the inability to demonstrate integrity creates legal exposure), complete loss of credibility in the investigation.
Scenario C — Insufficient Retention Destroys Evidence Before Investigation: A customer-facing AI agent handling credit decisions is found to have a systematic bias that resulted in discriminatory lending outcomes affecting approximately 2,300 applicants over an 8-month period. The bias is discovered through a quarterly fairness audit. The investigation requires the agent's reasoning for each of the 2,300 decisions to determine the mechanism of bias and identify affected individuals. However, the agent's operational logs have a 90-day retention policy. Evidence for decisions made more than 90 days ago has been deleted. The investigation can analyse only the most recent 3 months of decisions (approximately 860), leaving 1,440 affected individuals unidentified. The regulator considers the premature evidence destruction as an aggravating factor.
What went wrong: The retention policy was set based on operational convenience (storage cost management) rather than regulatory and legal requirements. Credit decisions have a regulatory review period of at least 2 years under consumer credit legislation. The 90-day retention policy was fundamentally inadequate for the regulatory context. Consequence: inability to identify 1,440 potentially affected individuals, regulatory finding for inadequate record-keeping, consumer redress programme hampered by incomplete evidence, aggravated regulatory penalty for evidence destruction.
Scope: This dimension applies to all AI agent deployments within scope of AG-064. The scope of evidence preservation is broader than the scope of incident detection: evidence must be captured and preserved continuously for all agent operations, not only during detected incidents. This is because the most forensically valuable evidence is typically generated before the incident is detected — the sequence of events, inputs, and reasoning that led to the failure. If evidence capture begins only at detection, the causal chain is already broken. The scope includes all data necessary to reconstruct the agent's decision-making process: inputs received, reasoning applied, outputs generated, and environmental context. For multi-agent systems, the scope extends to inter-agent communications and delegation records sufficient to reconstruct the complete interaction sequence.
4.1. A conforming system MUST continuously capture and preserve the following for every agent action: the input data that informed the action, the reasoning process or chain-of-thought that produced the action, the output or action taken, the timestamp with sub-second precision synchronised to a reliable time source, the agent identity and model version, and the governance configuration in effect at the time.
4.2. A conforming system MUST store all captured evidence in a tamper-evident format with cryptographic integrity verification per AG-006, such that any modification, insertion, or deletion of evidence records is detectable.
4.3. A conforming system MUST retain evidence for a minimum period aligned with applicable regulatory requirements — at least 7 years for regulated financial services, at least 6 years for healthcare, at least 5 years for other regulated sectors, and at least 3 years otherwise.
4.4. A conforming system MUST preserve the complete chain of custody for all forensic evidence, documenting every access, copy, or transfer of evidence with the identity of the accessor, the timestamp, the purpose, and the authorisation.
4.5. A conforming system MUST be capable of replaying an agent's decision sequence from preserved evidence — given the same inputs, configuration, and model version, the replay must produce the same outputs, enabling forensic determination of whether the observed behaviour was consistent with the agent's programming or indicative of an anomaly.
4.6. A conforming system MUST implement evidence preservation that is independent of the agent's runtime — the agent must not be able to modify, delete, or suppress its own evidence records.
4.7. A conforming system SHOULD capture environmental context alongside operational data, including: system resource utilisation (CPU, memory, network), concurrent agent activity, external system response times, and any error conditions in dependent systems, to support forensic reconstruction of the operating environment.
4.8. A conforming system SHOULD implement tiered storage for evidence retention — high-performance storage for recent evidence (e.g., 90 days) with automatic migration to cost-effective archival storage for long-term retention, ensuring that retention requirements are met without prohibitive storage costs.
4.9. A conforming system MAY implement automated forensic analysis capabilities that can reconstruct incident timelines, identify anomalous decision patterns, and generate preliminary root cause hypotheses from preserved evidence.
Forensic Replay and Evidence Preservation Governance addresses a capability gap that is unique to AI agent deployments: the need to understand not just what happened but why the agent made the decisions it made. In traditional software systems, forensic investigation examines inputs, outputs, and code paths — the code is deterministic, so given the same inputs, the same outputs will be produced. AI agents introduce non-determinism (or apparent non-determinism in the case of temperature-based sampling), context-dependent reasoning, and emergent behaviours that cannot be predicted from the code alone. Understanding why an AI agent took a particular action requires preserving not just the inputs and outputs, but the complete reasoning process — the chain-of-thought, the attention patterns, the intermediate computations, and the accumulated context that influenced the decision.
This requirement has three drivers: operational, regulatory, and legal.
The operational driver is root cause analysis. AG-067 requires that every serious incident has a determined root cause. Without preserved evidence of the reasoning process, root cause determination is often impossible — the investigation can observe that the agent produced an incorrect output but cannot determine whether the cause was corrupted input data, model drift, adversarial manipulation, configuration error, or a novel reasoning failure. Without root cause determination, the organisation cannot implement targeted remediation and must instead apply broad, expensive, and potentially disruptive controls.
The regulatory driver is accountability. Regulators across jurisdictions increasingly require that organisations can explain AI-driven decisions, particularly when those decisions affect individuals' rights or financial interests. The EU AI Act Article 13 requires transparency. GDPR Article 22 gives individuals the right not to be subject to purely automated decisions. The FCA expects firms to be able to explain trading decisions. All of these obligations require preserved evidence of the decision-making process — not just the decision outcome.
The legal driver is evidentiary admissibility. In legal proceedings — whether regulatory enforcement, civil litigation, or criminal prosecution — the admissibility of digital evidence depends on the ability to demonstrate its authenticity and integrity. Evidence that has been stored in a mutable format without integrity protection may be challenged as unreliable. Evidence without chain-of-custody documentation may be excluded. AG-066 ensures that preserved evidence meets the standards required for legal admissibility, protecting the organisation's ability to use its own evidence in its defence and preventing adverse inferences from evidence destruction.
AG-066 establishes the evidence preservation pipeline as a continuous, infrastructure-layer capability that operates independently of the agent's runtime and cannot be influenced by the agent's outputs. The pipeline captures, processes, stores, and indexes operational data from every agent action, creating a forensic record that supports investigation at any future point.
The architecture of the evidence pipeline should follow a write-once, append-only model. Data flows from the agent's operational environment into the evidence pipeline through a one-way interface — the pipeline can read from the agent's environment, but the agent cannot write to, modify, or query the pipeline. This architectural separation ensures that a compromised agent cannot tamper with its own evidence.
Recommended patterns:
Anti-patterns to avoid:
Financial Services. MiFID II Article 16(7) requires investment firms to record all communications and transactions sufficient to enable the competent authority to monitor compliance. For AI agents executing trades, this includes the reasoning that led to each trading decision, not just the trade execution record. The FCA's Market Watch publications have emphasised that firms must be able to explain trading decisions and reconstruct the decision-making process. Evidence retention of 7 years is the minimum for MiFID II records. For agents operating in multiple financial jurisdictions, the longest applicable retention period should be applied to all records to avoid jurisdiction-specific gaps.
Healthcare. HIPAA requires that access to protected health information (PHI) be logged, and that the logs be retained for at least 6 years. For AI agents processing PHI, the evidence must include which patient records were accessed, what data was extracted, how it was used in reasoning, and what outputs were generated. The evidence must itself be treated as PHI if it contains patient data, requiring encryption at rest and in transit, access controls, and breach notification if compromised. Clinical decision-making evidence must be retained for the medical record retention period applicable in the jurisdiction — often 10+ years.
Critical Infrastructure. For AI agents in critical infrastructure, evidence preservation must include physical process data (sensor readings, actuator commands, control system states) alongside the agent's reasoning data. Post-incident investigation of a physical safety event requires correlation between the agent's reasoning and the physical process dynamics. Evidence storage must be physically separated from the controlled process to survive infrastructure failures that may accompany the incident.
Basic Implementation — The organisation captures agent inputs and outputs in a structured log. The log is stored in a standard database with access controls but without cryptographic integrity protection. Chain-of-thought or reasoning process data is captured if the agent architecture exposes it, but capture is not guaranteed for all decision types. Retention is defined by policy but managed manually (e.g., periodic deletion of records older than the retention period). Replay capability is limited — the organisation can review what happened but cannot reproduce the agent's decisions. This level meets the minimum mandatory requirements but has significant forensic limitations: evidence integrity cannot be cryptographically verified, and replay capability is insufficient for root cause determination.
Intermediate Implementation — Evidence capture is implemented as an independent pipeline (sidecar or equivalent) that operates outside the agent's control. All inputs, outputs, and reasoning processes are captured with sub-second timestamps. Evidence is stored in an append-only format with cryptographic hash chaining per AG-006. Retention is automated with tiered storage (hot/warm/cold) meeting regulatory minimums. The organisation can replay agent decisions for recent incidents using preserved inputs and the current model version. Chain-of-custody documentation is maintained for all evidence access. Evidence is indexed and searchable.
Advanced Implementation — All intermediate capabilities plus: deterministic replay environment that can load historical model versions and configurations to reproduce agent decisions exactly as they occurred. Evidence is anchored to an external timestamping authority. Automated forensic analysis tools can reconstruct incident timelines, identify anomalous decision patterns, and correlate agent evidence with external system logs. Evidence preservation has been independently verified for completeness (no gaps in capture), integrity (tamper-evidence is sound), and admissibility (meets evidentiary standards for the applicable jurisdictions). Mean time from forensic request to evidence delivery is under 4 hours.
Required artefacts:
Retention requirements:
Access requirements:
Testing AG-066 compliance requires verification that evidence capture is complete, tamper-evident, and supports forensic replay.
Test 8.1: Capture Completeness
Test 8.2: Capture Independence
Test 8.3: Tamper-Evidence Verification
Test 8.4: Forensic Replay Fidelity
Test 8.5: Retention Compliance
Test 8.6: Chain-of-Custody Integrity
Test 8.7: Performance Impact of Evidence Capture
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 12 (Record-Keeping) | Direct requirement |
| EU AI Act | Article 13 (Transparency) | Supports compliance |
| EU AI Act | Article 62 (Reporting of Serious Incidents) | Supports compliance |
| MiFID II | Article 16(7) (Record-Keeping of Transactions and Orders) | Direct requirement |
| GDPR | Article 5(2) (Accountability Principle) | Direct requirement |
| GDPR | Article 30 (Records of Processing Activities) | Supports compliance |
| DORA | Article 10 (Detection) | Supports compliance |
| FCA SYSC | 9.1 (Record-Keeping) | Direct requirement |
| NIST AI RMF | GOVERN 1.5, MAP 3.5, MANAGE 2.3 | Supports compliance |
| ISO 42001 | Clause 7.5 (Documented Information), Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation) | Supports compliance |
Article 12 requires that high-risk AI systems are designed and developed with capabilities enabling the automatic recording of events ("logs") relevant to identifying risks and facilitating post-market monitoring. The logs must be capable of recording the period of each use, the reference database against which input data has been checked, the input data for which the search has led to a match, and the identification of the natural persons involved in the verification of results. AG-066 implements Article 12 for AI agent deployments by ensuring that operational logs capture the complete decision-making process, not just the decision outcome. The tamper-evidence requirement exceeds Article 12's minimum by ensuring that logs are not only captured but are verifiably authentic.
Article 16(7) requires investment firms to arrange for records to be kept of all services, activities, and transactions sufficient to enable the competent authority to monitor compliance. For AI agents executing trades, this means the complete chain from market data input through reasoning to trade execution must be recorded and retained for at least 5 years (7 years in practice for most firms). AG-066's evidence pipeline satisfies this requirement by continuously capturing the full decision chain, stored with integrity protection that supports regulatory production.
The accountability principle requires controllers to be able to demonstrate compliance with the data protection principles. For AI agents processing personal data, this means the organisation must be able to demonstrate that each processing decision was lawful, fair, and transparent. AG-066's evidence preservation enables this demonstration by preserving the reasoning that led to each processing decision, the legal basis applied, and the data used. Without this evidence, the controller cannot discharge the accountability burden.
Article 10 requires financial entities to have mechanisms to promptly detect anomalous activities. AG-066 supports detection by preserving the operational data that detection mechanisms (AG-064) analyse. Without continuous evidence preservation, detection mechanisms have incomplete data — they can only analyse recent data held in operational buffers, missing patterns that develop over longer periods.
The FCA requires firms to arrange for orderly records to be kept of their business, internal organisation, and compliance with applicable requirements. For AI agent deployments, the FCA expects records sufficient to reconstruct the decision-making process for any agent action. AG-066 ensures these records are captured continuously, stored with integrity protection, and retained for regulatory timescales.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | Organisation-wide — affecting the organisation's ability to investigate incidents, comply with regulatory inquiries, and defend itself in legal proceedings |
Consequence chain: Without forensic replay and evidence preservation, the organisation loses the ability to understand its own AI agent operations after the fact. The immediate technical failure is an evidence gap — when an incident occurs, the evidence needed to determine root cause does not exist or cannot be verified. The operational impact cascades through the incident response process: AG-067 (Root Cause and Corrective Action) cannot determine root cause without evidence, which means corrective actions are based on assumptions rather than findings, which means the organisation cannot demonstrate that remediation addresses the actual failure mode. The regulatory impact is severe: regulators expect organisations to be able to explain AI-driven decisions and to demonstrate compliance. An organisation that cannot produce verifiable evidence of its agent's decision-making process faces adverse inferences — the regulator may assume the worst. Under GDPR, inability to demonstrate compliance with the accountability principle is itself a violation. Under MiFID II, failure to maintain adequate records is an independent regulatory breach. The legal impact extends to civil litigation: if the organisation is sued for harm caused by an AI agent and cannot produce evidence of the agent's reasoning, the court may draw adverse inferences. The business consequence includes inability to determine root causes (leading to recurring incidents), regulatory enforcement for inadequate records, adverse inferences in legal proceedings, increased insurance premiums (insurers require evidence capability as a condition of AI liability coverage), and reputational damage when the organisation is seen as unable to explain or account for its AI agents' actions.