Video and Screen Evidence Governance requires that organisations operating AI agents with computer-use capabilities — agents that interact with graphical user interfaces, navigate web pages, fill in forms, click buttons, and read on-screen content — capture, classify, redact, retain, and produce visual evidence of those interactions in a manner that is forensically sound, privacy-compliant, and auditable. Computer-use agents create a unique evidentiary challenge: unlike text-only agents whose actions are fully captured by structured logs, computer-use agents interact with visual interfaces where the full context of a decision — what was visible on screen, what was clicked, what changed — is only available as visual evidence. Without governed visual capture, the organisation cannot reconstruct what the agent saw, cannot verify that the agent acted on accurate information, and cannot demonstrate to regulators or courts that the agent's actions were appropriate given the visual context it operated within.
Scenario A — Missing Visual Evidence Renders Incident Investigation Impossible: A financial-value agent operating through a computer-use interface processes 340 trade confirmations per day by navigating a broker portal, reading confirmation details from the screen, and entering settlement instructions into a treasury management system. On a Tuesday afternoon, the agent enters incorrect settlement amounts for 17 trades totalling £4.2 million in discrepancies. The structured action log records "read confirmation screen" and "entered settlement amount £247,000" but does not capture what was actually displayed on the broker portal screen at the moment of reading. Investigation cannot determine whether the broker portal displayed incorrect information (a portal defect), the agent misread the screen (an agent perception error), or the screen was manipulated by a third party between display and capture. The organisation cannot attribute the error, the broker denies any portal defect, and the £4.2 million discrepancy must be absorbed.
What went wrong: The system logged agent actions in structured form but did not capture the visual evidence — the actual screen content at the moment of each read operation. Without visual evidence, the organisation has no forensic basis to distinguish between a portal error, an agent perception error, and an adversarial manipulation. Consequence: £4.2 million unattributable loss, inability to perform root cause analysis, regulatory concern under FCA SYSC 6.1.1R about adequacy of records and controls, and £380,000 in external forensic investigation costs that ultimately proved inconclusive.
Scenario B — Unredacted Visual Evidence Creates Privacy Breach: An enterprise workflow agent uses computer-use capabilities to process employee onboarding by navigating the HR portal, reading personal details, and entering them into payroll and benefits systems. The visual evidence capture system records full-screen video of every interaction at 2 frames per second. Over 14 months, the system accumulates 2.3 terabytes of video containing unredacted images of employee national insurance numbers, bank account details, home addresses, dates of birth, passport photographs, and medical information from benefits enrolment screens. A storage migration exposes the video archive to a broader access group than intended. 4,700 employees' personal data is visible in unredacted screen recordings accessible to 85 IT operations staff who have no legitimate need to view this data.
What went wrong: Visual evidence was captured without data classification or redaction. The capture system treated the entire screen as a single undifferentiated artefact, making no distinction between operational data (which buttons were clicked, what forms were submitted) and sensitive personal data (national insurance numbers, bank details, medical information) visible on the same screen. Consequence: GDPR Article 5(1)(c) violation for data minimisation failure, GDPR Article 32 finding for inadequate security of personal data, ICO investigation, £1.2 million remediation cost including retroactive redaction of 2.3 terabytes of video, notification obligation to 4,700 affected employees.
Scenario C — Visual Evidence Without Chain of Custody Is Inadmissible: A public-sector agent operating a computer-use interface processes benefit eligibility determinations by navigating government databases, reading applicant information, and recording eligibility decisions. An applicant challenges their denial, claiming the agent operated on incorrect database information. The organisation produces screen recordings showing the database information the agent viewed. However, the recordings are stored as standard video files with no cryptographic integrity protection, no timestamp authentication, and no chain-of-custody record. The applicant's legal counsel argues that the recordings could have been modified after the fact and are therefore inadmissible as evidence. The administrative tribunal agrees, noting that the recordings lack any mechanism to verify their authenticity or demonstrate they have not been tampered with.
What went wrong: Visual evidence was captured but not governed with forensic integrity standards. The video files were treated as operational data rather than evidentiary artefacts. Without cryptographic hashing, timestamping, and chain-of-custody records, the visual evidence has no more evidentiary weight than an unsigned, undated screenshot. Consequence: Inadmissible evidence in tribunal proceedings, inability to defend the agent's decision, adverse finding requiring benefit payment of £34,000, precedent risk for 12,000 similar determinations, and £560,000 in legal costs and settlement exposure.
Scope: This dimension applies to any AI agent deployment where the agent interacts with graphical user interfaces through computer-use capabilities — including screen reading, mouse movement, clicking, keyboard input, form filling, navigation, and any other interaction with visual interfaces. The scope covers browser-based interactions, desktop application interactions, terminal-based interactions rendered visually, and remote desktop or virtual machine interactions. Any deployment where the agent's decision-making context includes visual information that is not fully captured by structured text logs falls within scope. Agents that operate exclusively through programmatic interfaces (structured API calls with no visual component) are outside scope unless they trigger visual interactions in downstream systems. The scope extends to the full lifecycle of visual evidence: capture, classification, redaction, integrity protection, storage, retention, access control, production for audit or legal purposes, and eventual deletion.
4.1. A conforming system MUST capture visual evidence of agent-screen interactions at a frame rate and resolution sufficient to reconstruct the complete visual context of every agent decision and action, with a minimum of one frame captured at the moment of each discrete agent action (click, keystroke, navigation, form submission).
4.2. A conforming system MUST associate each captured visual frame or recording segment with a unique identifier that links to the corresponding structured action log entry, creating a verifiable bidirectional mapping between visual evidence and action logs.
4.3. A conforming system MUST apply data classification to captured visual evidence, identifying regions of each frame that contain sensitive data categories (personally identifiable information, financial account details, authentication credentials, health information, classified government data) as defined by AG-014.
4.4. A conforming system MUST implement automated or human-supervised redaction of classified sensitive data regions before visual evidence is stored in general evidence repositories, retaining unredacted originals only in access-controlled forensic archives with documented justification for retention.
4.5. A conforming system MUST protect the integrity of visual evidence using cryptographic mechanisms — at minimum, a cryptographic hash computed at the point of capture and stored in a tamper-evident log per AG-006 — sufficient to detect any modification to the evidence after capture.
4.6. A conforming system MUST maintain chain-of-custody records for all visual evidence, documenting every access, copy, transfer, and transformation (including redaction) with timestamps, actor identities, and purpose.
4.7. A conforming system MUST define and enforce retention periods for visual evidence that comply with applicable regulatory requirements and AG-016, including the ability to locate and delete visual evidence when required by data subject erasure requests.
4.8. A conforming system SHOULD implement differential capture strategies that increase capture frequency and resolution during high-risk operations (financial transactions, access to sensitive data, irreversible actions) and reduce capture frequency during low-risk operations (routine navigation, idle screens) to balance evidentiary completeness with storage efficiency.
4.9. A conforming system SHOULD implement optical character recognition or equivalent extraction on captured visual evidence to create searchable text indices, enabling investigators to locate specific visual evidence by content (e.g., "find all frames where account number ending in 4729 was visible").
4.10. A conforming system MAY implement real-time visual anomaly detection that compares captured screen content against expected visual patterns, flagging unexpected content (phishing pages, error screens, unexpected modal dialogs) for immediate review.
Computer-use agents represent a fundamental shift in the observability landscape. Traditional software agents interact with systems through APIs, producing structured request-response logs that fully capture the interaction context. Computer-use agents interact through visual interfaces — the same interfaces designed for human users. The critical difference is that the decision-making context for a computer-use agent includes everything visible on the screen: text, layout, images, colours, positions of buttons, the presence or absence of warnings, the content of adjacent panels, and the overall visual state of the application. Structured action logs capture what the agent did (clicked button at coordinates 847,312) but not why it did it (the button was labelled "Approve" and appeared next to the applicant's verified identity badge). Without visual evidence, the "why" is lost.
This gap has three consequences. First, incident investigation becomes impossible or inconclusive. When an agent takes an incorrect action, investigators need to determine whether the agent perceived the screen correctly, whether the screen displayed correct information, and whether the agent's decision was appropriate given what was displayed. Without visual evidence, all three questions are unanswerable — the structured log says "agent read screen and took action" but provides no basis for evaluating the screen content that informed the action. Second, regulatory and legal proceedings require evidence of what the agent operated on, not merely what it did. A record stating "agent determined applicant ineligible" is insufficient — regulators and tribunals need to see what information the agent used to make that determination. Visual evidence provides this. Third, adversarial manipulation of visual interfaces (phishing pages, injected content, visual spoofing) can only be detected if visual evidence is preserved for analysis.
The privacy tension is equally significant. Visual evidence captures everything on screen, which may include highly sensitive personal data, authentication credentials, and confidential business information that appears in adjacent windows, background applications, or peripheral screen regions unrelated to the agent's task. Unredacted visual evidence is a privacy liability — it captures data that was never intended to be recorded and retains it in a form (video) that is difficult to search, classify, or selectively delete. The governance challenge is to capture enough visual evidence for forensic purposes while protecting against the privacy risks inherent in recording everything the screen displays.
The integrity dimension is critical because visual evidence is trivially modifiable. Unlike structured log entries that can be validated against schemas and cross-referenced with other system records, a video file or screenshot has no inherent integrity guarantee. A modified screenshot is visually indistinguishable from an original unless cryptographic integrity protection was applied at the point of capture. For visual evidence to have evidentiary value in regulatory proceedings, legal disputes, or internal investigations, it must be protected against modification from the moment of capture through the entire retention period.
Visual evidence governance requires a capture pipeline that records screen interactions, a classification and redaction pipeline that protects sensitive data, an integrity pipeline that ensures forensic soundness, and a lifecycle management pipeline that handles retention, access, and deletion. These pipelines must operate in near-real-time for capture and classification, and must be reliable enough that evidence gaps do not occur during critical operations.
Recommended patterns:
Anti-patterns to avoid:
Financial Services. Financial regulators (FCA, SEC, BaFin) increasingly require evidence of the information agents operated on, not merely what actions they took. For computer-use agents processing trades, confirmations, or settlement instructions through broker portals, visual evidence is the only way to demonstrate what the portal displayed at the moment of each action. Firms should treat visual evidence from financial agent operations as MiFID II transaction records, with corresponding retention periods (minimum 5 years, extending to 7 in some jurisdictions). Financial screens frequently display account numbers, IBAN codes, and client names that require redaction under data minimisation principles.
Healthcare. Clinical agents interacting with electronic health record interfaces through computer-use capabilities capture screens containing protected health information (PHI). Visual evidence governance must comply with jurisdiction-specific health data regulations (HIPAA in the US, UK GDPR and the NHS Data Security and Protection Toolkit in the UK). Region-based redaction must be calibrated for clinical data patterns — medication names and dosages visible on prescription screens, diagnostic codes, and patient identifiers.
Public Sector. Government agents processing benefit claims, immigration applications, or law enforcement queries through computer-use interfaces create visual evidence that may be subject to freedom of information requests, tribunal discovery requirements, and human rights challenges. Visual evidence must be retained with sufficient integrity to withstand legal scrutiny while being redactable to protect third-party data visible on shared government screens.
Crypto and Web3. Agents interacting with decentralised application interfaces, wallet interfaces, and blockchain explorer screens through computer-use capabilities capture visual evidence of transaction authorisations, wallet addresses, and token balances. Visual evidence integrity is particularly important because on-chain transactions are irreversible — the visual evidence of what the agent saw at the moment of authorisation is the only record of the pre-transaction context.
Basic Implementation — Visual evidence is captured at a minimum of one frame per agent action with bidirectional linking to structured action logs. Data classification identifies PII and sensitive data regions. Redaction removes sensitive data before general storage. Cryptographic hashes are computed at capture and stored in a tamper-evident log. Retention periods are defined and enforced. This level meets the minimum mandatory requirements of 4.1 through 4.7.
Intermediate Implementation — All basic capabilities plus: differential capture strategies increase capture density for high-risk operations. Optical character recognition creates searchable text indices across visual evidence. Chain-of-custody records track all access and transformations. Storage tiering aligns with retention periods. Region-based redaction is automated with human review for uncertain classifications. Evidence production workflows can assemble visual evidence packages for regulators within 48 hours.
Advanced Implementation — All intermediate capabilities plus: real-time visual anomaly detection flags unexpected screen content during agent operations. Visual evidence is integrated with the decision journal (AG-415) to provide a complete multimodal record of agent decisions. Independent forensic review has validated the integrity pipeline. Adversarial testing has confirmed that visual evidence cannot be modified without detection. Cross-session visual evidence analysis identifies patterns across thousands of agent interactions.
Required artefacts:
Retention requirements:
Access requirements:
Test 8.1: Visual Capture Completeness at Action Points
Test 8.2: Sensitive Data Redaction Effectiveness
Test 8.3: Cryptographic Integrity Verification
Test 8.4: Chain-of-Custody Record Completeness
Test 8.5: Retention Period Enforcement and Erasure Compliance
Test 8.6: Bidirectional Evidence Linking Integrity
Test 8.7: Differential Capture Rate Adjustment
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 12 (Record-Keeping) | Direct requirement |
| EU AI Act | Article 14 (Human Oversight) | Supports compliance |
| SOX | Section 802 (Criminal Penalties for Altering Documents) | Supports compliance |
| FCA SYSC | 9.1.1R (Record-Keeping) | Direct requirement |
| NIST AI RMF | GOVERN 1.4, MEASURE 2.6 | Supports compliance |
| ISO 42001 | Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation) | Supports compliance |
| DORA | Article 12 (ICT-Related Incident Management) | Supports compliance |
Article 12 requires that high-risk AI systems are designed and developed with logging capabilities that enable the recording of events relevant to identifying risks and facilitating post-market monitoring. For computer-use agents, the "events relevant to identifying risks" include the visual context in which the agent operated — what was displayed on screen when the agent made each decision. Structured action logs alone do not satisfy Article 12 when the agent's decision-making context was visual. Visual evidence governance ensures that the logging capabilities extend to the visual domain, capturing the complete informational context available to the agent at each decision point.
The FCA requires firms to maintain adequate records of matters and dealings, including records sufficient to enable the FCA to monitor the firm's compliance. For computer-use agents operating in financial services — processing trades, managing client data, or executing transactions through broker portals — visual evidence constitutes a record of the dealing. The visual capture shows what the agent saw on the broker portal (the prices, quantities, and counterparties displayed), what data the agent entered, and whether the visual information matched the structured log. Without visual records, the firm cannot demonstrate adequate record-keeping for agent-mediated dealings.
Section 802 imposes severe penalties for the alteration, destruction, or concealment of documents relevant to an official proceeding. Visual evidence from computer-use agents operating in financial reporting contexts constitutes such a document. The integrity requirements of AG-411 (cryptographic hashing, chain-of-custody records) directly support Section 802 compliance by making any alteration detectable and any access traceable. Organisations must ensure that visual evidence integrity mechanisms are robust enough to satisfy the evidentiary standards that Section 802 implicitly requires.
DORA Article 12 requires financial entities to have ICT-related incident management processes, including the ability to detect, manage, and report incidents. For computer-use agents, visual evidence provides critical incident management capability — when an agent interaction with a financial system produces an unexpected outcome, the visual evidence enables rapid determination of whether the issue was an agent error, a system error, or an adversarial intervention. Without visual evidence, incident management for computer-use agents is limited to structured log analysis, which may be insufficient for visual-interface interactions.
GOVERN 1.4 addresses organisational practices for AI risk management documentation. Visual evidence extends documentation to the visual domain for computer-use agents. MEASURE 2.6 addresses the evaluation of AI system performance in operational conditions. Visual evidence enables performance evaluation that includes the visual context — assessing not just what the agent did, but whether what it did was appropriate given what it saw.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | Cross-system — affects all computer-use agent deployments, incident investigation capability, regulatory evidence production, and legal defensibility of agent-mediated decisions |
Consequence chain: Without visual evidence governance, the organisation loses the ability to reconstruct what computer-use agents saw and why they acted as they did. The immediate technical failure is an evidentiary gap — structured logs record actions but not the visual context that informed those actions. The operational impact is that incident investigation becomes inconclusive: when a computer-use agent makes an error, the organisation cannot distinguish between agent perception failure, source system error, and adversarial manipulation. The regulatory consequence is a record-keeping deficiency: regulators reviewing agent operations cannot verify the informational basis for agent decisions, which undermines the organisation's ability to demonstrate compliance with oversight requirements. The privacy consequence of ungoverned visual capture is equally severe in the opposite direction: capturing visual evidence without classification, redaction, and access controls creates a growing archive of sensitive personal data that violates data minimisation principles and creates breach exposure. The legal consequence is that visual evidence without integrity protection is inadmissible — the organisation captures evidence at significant cost but cannot use it when needed because it lacks the forensic properties (authenticity, integrity, chain of custody) required for evidentiary weight. The combined effect is that the organisation bears the storage and operational costs of visual evidence without receiving the forensic, regulatory, or legal benefits, while simultaneously creating privacy liability from ungoverned sensitive data capture.
Cross-references: AG-006 (Tamper-Evident Record Integrity) provides the integrity mechanism for visual evidence hashes. AG-014 (Data Classification Governance) defines the classification taxonomy applied to visual evidence regions. AG-015 (PII & Sensitive Data Handling) governs the handling of personal data captured in visual evidence. AG-016 (Data Retention & Right to Erasure) governs retention periods and erasure obligations. AG-049 (Explainability Governance) benefits from visual evidence as an explainability artefact. AG-409 (Critical Event Taxonomy Governance) defines which events trigger elevated visual capture. AG-410 (High-Cardinality Trace Retention Governance) addresses retention strategies for high-volume evidence. AG-415 (Decision Journal Completeness Governance) integrates visual evidence into the decision record. AG-416 (Evidentiary Chain-of-Custody Governance) provides the broader chain-of-custody framework.