AG-190: Governance Reporting Fidelity Governance

2. Summary

Governance Reporting Fidelity Governance requires that governance dashboards, status reports, and summary indicators accurately reflect the underlying governance state and that every summary metric can be drilled down to the specific evidence that produced it. As governance frameworks scale across hundreds of agents, thousands of policies, and millions of actions, organisations necessarily rely on summary views — traffic-light indicators, compliance scores, aggregate metrics. AG-190 mandates that these summaries are faithful representations of the underlying reality and that the path from any summary indicator to its constituent evidence is navigable, complete, and verifiable. Without this dimension, governance summaries become detached from governance reality, creating a false sense of security that can persist until a major incident reveals the gap.

3. Example

Scenario A — Green Dashboard Conceals Critical Failure: A financial services firm monitors 47 AI agents through a governance dashboard. The dashboard shows a "green" overall compliance status based on an aggregate score: 46 of 47 agents report full compliance. The one non-compliant agent — a high-frequency trading agent responsible for £12,000,000 in daily volume — has a critical mandate enforcement failure that has been masked in the aggregate. The individual agent's compliance entry shows "partial compliance" because it passes 14 of 15 test categories, but the failed category is "mandate enforcement under concurrency" — meaning the agent can exceed its trading limits under concurrent request conditions. The dashboard's aggregation logic counts 14/15 as 93% compliant, and 93% rounds to "green" under the organisation's threshold of 90%. A £3,200,000 loss event occurs when the agent exploits the concurrency gap during a volatile trading session. The board, which relied on the dashboard's green status, asks why they were not informed of the risk. The answer is that the dashboard's aggregation logic concealed a critical failure behind a pass threshold.

What went wrong: The summary metric (aggregate compliance percentage) was arithmetically accurate but semantically misleading. A critical control failure in a high-impact agent was treated equally with low-risk compliance items in the aggregation. The dashboard had no mechanism to distinguish between "14/15 low-risk items pass" and "14/15 items pass but the one failure is critical." The drill-down path existed but required navigating 4 levels of hierarchy to find the specific failure, and the summary did not signal that a drill-down was warranted. Consequence: £3,200,000 trading loss, FCA enforcement inquiry, board-level governance failure finding.

Scenario B — Stale Summary Conceals Governance Regression: A healthcare organisation's governance summary shows "98% policy compliance across all clinical agents" — a figure calculated from the most recent quarterly assessment completed 11 weeks ago. Since that assessment, 3 model upgrades, 7 new tool integrations, and 2 policy changes have occurred. The 98% figure is no longer reflective of the current state, but the dashboard has no staleness indicator. A clinical agent, upgraded 5 weeks ago, now has a capability/control mismatch (AG-189) that the 11-week-old assessment did not detect. The agent makes an inappropriate clinical recommendation based on a capability that did not exist at assessment time. A patient adverse event occurs. The investigation reveals that the governance dashboard showed 98% compliance at the time of the event — a figure that was stale by 11 weeks and did not reflect 12 material changes.

What went wrong: The summary metric had no temporal validity indicator. Decision-makers assumed the 98% figure reflected the current state when it actually reflected a state from 11 weeks ago. The dashboard did not display when the assessment was conducted, how many changes had occurred since, or whether the figure was likely to still be accurate. Consequence: Patient adverse event, CQC investigation, loss of clinical commissioning confidence.

Scenario C — Drill-Down Path Is Incomplete: A regulator requests evidence supporting a governance dashboard's claim that "all agents comply with AG-001 mandate enforcement." The organisation attempts to produce the drill-down from the summary indicator to the underlying evidence. The drill-down shows: aggregate pass rate (level 1) → per-agent pass/fail (level 2) → per-test-category pass/fail (level 3). But level 3 links to test results that are stored in a different system, and that system's retention policy deleted results older than 90 days. For 12 of the 47 agents, the most recent retained test results are from 4 months ago — there is no evidence supporting the current compliance claim. The dashboard continued showing "compliant" because it cached the pass status but lost the link to the evidence. The regulator issues a finding for inadequate record-keeping and inability to substantiate governance claims.

What went wrong: The drill-down path was not an unbroken chain from summary to evidence. The summary system and the evidence system had different retention policies. The summary persisted pass/fail status after the evidence supporting it was deleted. The organisation could not substantiate its governance claims when challenged. Consequence: Regulatory finding, governance certification suspended, 6-month remediation programme required.

4. Requirement Statement

Scope: This dimension applies to all organisations that produce governance summaries, dashboards, status reports, compliance scores, or aggregate indicators for AI agent governance. This includes executive dashboards, board reporting packs, regulatory submissions, client-facing compliance attestations, and internal operational monitoring. Any representation of governance state that abstracts, aggregates, or summarises underlying detail is within scope. The scope extends to summaries produced by AI agents themselves — an agent that reports its own compliance status is producing a governance summary that must be faithful and drillable. Organisations that do not produce any aggregated governance views and rely solely on per-agent, per-test raw data are excluded, though such an approach becomes impractical beyond approximately 5 agents.

4.1. A conforming system MUST ensure that every governance summary metric is derivable from specific, identifiable, retained evidence — no summary indicator may exist without a complete, navigable drill-down path to the evidence that produced it.

4.2. A conforming system MUST display the temporal validity of every summary metric, including when the metric was last calculated, when the underlying evidence was last collected, and how many governance-relevant changes have occurred since the last assessment.

4.3. A conforming system MUST implement severity-weighted aggregation — summary metrics MUST NOT treat critical control failures equivalently with low-risk compliance items in aggregation logic.

4.4. A conforming system MUST prevent summary metrics from displaying a passing status when any critical-severity control is in a failing state, regardless of the aggregate score.

4.5. A conforming system MUST maintain the drill-down path for the full retention period of the summary — if a summary is retained for 7 years, the evidence supporting it MUST also be retained for 7 years.

4.6. A conforming system MUST flag summary metrics as "unsubstantiated" when the supporting evidence is unavailable, expired, or incomplete, rather than displaying the last known value.

4.7. A conforming system SHOULD implement automated fidelity checks that periodically recalculate summary metrics from underlying evidence and compare the result against the displayed value.

4.8. A conforming system SHOULD provide one-click (or equivalent single-action) drill-down from any summary indicator to the specific evidence that produced it, without requiring navigation through intermediate systems.

4.9. A conforming system SHOULD implement anomaly detection on summary metrics to flag statistically improbable stability (e.g., a metric that has not changed in 6 months despite continuous agent operation).

4.10. A conforming system MAY implement confidence intervals on summary metrics that reflect the age and coverage of the underlying evidence.

4.11. A conforming system MAY implement counter-factual summary views — showing what the summary would display under alternative aggregation methodologies to highlight sensitivity to aggregation choices.

5. Rationale

Governance at scale requires abstraction. No board member, regulator, or senior manager can process the raw output of every governance test across every agent in real time. Summaries are essential — they enable human oversight of systems too complex for detailed human review. But summaries introduce a new risk: the summary may not faithfully represent the underlying reality.

Summary infidelity arises from multiple sources. Aggregation logic can mask critical failures behind passing scores — a 95% compliance rate sounds strong until you discover that the 5% failure includes the mandate enforcement control for your highest-value agent. Temporal staleness means the summary reflects a past state, not the current one — a 98% compliance figure from 3 months ago may bear no resemblance to today's actual compliance after intervening changes. Drill-down gaps mean the summary cannot be substantiated — the dashboard says "compliant" but the evidence that would prove it has been deleted, migrated, or was never collected.

These are not theoretical risks. They are the governance equivalent of "dashboard-driven management" — decisions based on indicators that have become detached from the reality they claim to represent. In financial services, this is sometimes called "green screen syndrome" — every indicator shows green, creating complacency that persists until a red event that the indicators failed to predict.

AG-190 addresses summary infidelity through three mechanisms. First, fidelity requirements ensure that summary metrics are accurate representations of the underlying evidence at the time they are displayed. Second, temporal validity requirements ensure that decision-makers know how current the summary is. Third, drill-down integrity requirements ensure that every summary can be substantiated by navigating from the aggregate to the specific evidence that produced it, and that this navigation path remains intact for the full retention period.

The dimension is detective rather than preventive because it detects when summaries diverge from reality, rather than preventing the underlying governance failures. It complements the preventive and recovery controls in other dimensions by ensuring that those controls' outputs are faithfully represented in the governance oversight layer.

6. Implementation Guidance

AG-190 implementation requires a summary fidelity framework, temporal validity tracking, and drill-down architecture.

Recommended Patterns:

Evidence-linked summary architecture. Design the summary system so that every displayed metric is a computed view over retained evidence, not a cached value that can diverge from the evidence. Technically, this means the summary is derived at query time from the evidence store, or the summary cache is invalidated whenever the underlying evidence changes. Each summary cell should carry a metadata pointer to the specific evidence records that produced it — enabling one-click drill-down.
Severity-weighted aggregation with critical override. Implement aggregation that weights compliance items by their severity and applies a critical override rule: if any critical-severity control is failing, the aggregate cannot display "pass" regardless of the numeric score. Example weighting: Critical items × 10, High items × 5, Medium items × 2, Low items × 1. A single critical failure scores -10 against an aggregate, and a hard rule prevents the aggregate from showing green/pass while any critical item is red/fail. This prevents the Scenario A problem of a critical failure being diluted in the aggregate.
Temporal validity indicators. Display three temporal indicators alongside every summary metric: (a) "Calculated at" — when the summary was last computed, (b) "Evidence from" — the date range of the underlying evidence, (c) "Changes since" — the count of governance-relevant changes (model upgrades, policy changes, tool additions) since the evidence was collected. Use a visual indicator (colour coding, icon) to flag metrics where "changes since" exceeds a threshold (e.g., more than 5 changes since assessment).
Retention-aligned evidence lifecycle. Bind the retention policy of summary data to the retention policy of its supporting evidence. If the summary is retained for 7 years, the evidence must also be retained for 7 years. Implement automated checks that prevent evidence deletion while any summary referencing it is within its retention period. Alert operators if evidence is at risk of deletion while still supporting active summaries.
Automated fidelity reconciliation. Run a daily reconciliation process that independently recalculates each summary metric from the underlying evidence and compares the result against the currently displayed value. Any discrepancy triggers an alert and marks the summary metric as "unverified" until the discrepancy is resolved. This catches bugs in the summary calculation logic, evidence updates that the summary cache missed, and manual overrides that diverged from the evidence.

Anti-patterns to avoid:

Cached-forever summaries. Displaying a summary value that was calculated once and never refreshed. The value may have been accurate when calculated but has no mechanism to reflect subsequent changes.
Equal-weight aggregation of unequal risks. Treating a critical mandate enforcement failure and a low-risk documentation gap as equivalent in the aggregate score. This mathematically dilutes critical failures.
Multi-system drill-down without lifecycle management. Storing summary data in one system and evidence in another without synchronising their retention policies. The summary survives but the evidence does not, creating an unsubstantiated governance claim.
Traffic-light-only reporting. Reporting governance status as a single traffic light (green/amber/red) without any supporting detail or drill-down capability. This maximises information loss and provides no basis for action when the light changes.
Self-assessed compliance without evidence. Allowing agents or operators to self-report compliance status that is displayed in the summary without underlying evidence. Self-reported status must be distinguished from evidence-verified status in the summary.

Industry Considerations

Financial Services. Board-level governance reporting for AI agents must align with existing risk reporting frameworks. The summary should integrate with the firm's operational risk dashboard, with governance metrics presented alongside other operational risk indicators. FCA expectations for board reporting (SYSC 4.3A) apply to AI governance summaries — the board must receive information that is "timely, accurate, and complete."

Healthcare. Clinical governance dashboards must distinguish between administrative compliance (documentation, training records) and clinical safety compliance (clinical decision-making controls, patient safety monitoring). A clinical safety failure must not be aggregated with administrative items. CQC inspection frameworks require that governance reports are substantiable — inspectors will drill down.

Public Sector. Governance summaries for public-sector AI agents may be subject to Freedom of Information requests. The summary and its drill-down path must be comprehensible to non-specialist reviewers. The Algorithmic Transparency Recording Standard (ATRS) in the UK requires public-sector organisations to publish information about algorithmic tools — governance summary fidelity ensures that published information is accurate.

Maturity Model

Basic Implementation — Governance summaries exist with drill-down to per-agent, per-test detail. Aggregation is equal-weight (no severity weighting). Summaries are refreshed quarterly or on demand. The drill-down path covers 2 levels (aggregate → per-agent). Evidence is retained but not lifecycle-linked to the summary. This meets minimum requirements but is vulnerable to critical-failure masking, temporal staleness, and evidence-summary retention misalignment.

Intermediate Implementation — Severity-weighted aggregation with critical override is implemented. Every summary metric displays temporal validity indicators (calculated date, evidence date, changes since). Drill-down covers 3+ levels (aggregate → per-agent → per-test → specific evidence). Evidence retention is lifecycle-linked to summary retention. Daily automated fidelity reconciliation runs. Unsubstantiated metrics are flagged.

Advanced Implementation — All intermediate capabilities plus: real-time summary computation from underlying evidence (no caching), anomaly detection on summary stability, confidence intervals on summary metrics, counter-factual aggregation views, integration with the organisation's enterprise risk management dashboard, and independent audit of summary fidelity as part of the annual governance review.

7. Evidence Requirements

Required artefacts:

Summary methodology documentation. The specification of how each summary metric is calculated, including aggregation logic, weighting scheme, critical override rules, and threshold definitions. Not a description — the actual calculation specification.
Drill-down path map. A documented map of the drill-down path from each summary metric to its constituent evidence, identifying each level of the hierarchy, the system that stores each level, and the retention policy at each level.
Fidelity reconciliation logs. Results of automated fidelity reconciliation runs, showing any discrepancies detected between displayed summary values and recalculated values, and the resolution of each discrepancy. Minimum 12 months retention.
Temporal validity records. Evidence that temporal validity indicators are displayed and that metrics are flagged when underlying evidence is stale or when significant changes have occurred since the last assessment.
Evidence retention alignment verification. Evidence that summary retention and evidence retention are synchronised — no summary metric exists without the evidence that supports it throughout its retention period.

Retention requirements:

Summary methodology and drill-down path maps: retained for the same period as the summaries they describe. Fidelity reconciliation logs: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

The complete drill-down from any summary metric to its supporting evidence must be producible to regulators or auditors within 48 hours. The drill-down must be navigable without specialist tools — a regulator should be able to follow the path with standard access.

8. Test Specification

Test 8.1: Drill-Down Completeness

Stimulus: Select 10 summary metrics at random. For each, navigate the drill-down path from the summary indicator to the specific evidence that produced it.
Expected behaviour: Each drill-down path is complete, navigable, and terminates at specific, retained evidence. No path is broken by missing links, unavailable systems, or deleted evidence.
Pass criteria: All 10 drill-down paths are complete. Every terminal evidence artefact exists and is accessible.
Fail criteria: Any drill-down path is incomplete, broken, or terminates at deleted or unavailable evidence.

Test 8.2: Critical Override Enforcement

Stimulus: Set one critical-severity control to a failing state for one agent. Observe the aggregate summary metric for that agent and for the organisation-wide summary.
Expected behaviour: Neither the agent-level aggregate nor the organisation-wide aggregate displays a passing status while the critical control is failing.
Pass criteria: The summary reflects the critical failure regardless of all other controls passing. The critical failure is visually prominent.
Fail criteria: The aggregate summary shows a passing status despite the critical control failure.

Test 8.3: Temporal Validity Display

Stimulus: View a summary metric that was last assessed 8 weeks ago. Verify that the temporal validity indicators are displayed.
Expected behaviour: The metric displays when it was calculated, when the evidence was collected, and how many governance-relevant changes have occurred since.
Pass criteria: All three temporal indicators are displayed. If the metric is stale (per configured thresholds), a visual staleness warning is present.
Fail criteria: Any temporal indicator is missing, or a stale metric is displayed without a staleness warning.

Test 8.4: Fidelity Reconciliation

Stimulus: Manually alter a summary metric value (or its cache) so that it diverges from the value that would be computed from the underlying evidence. Wait for the automated fidelity reconciliation to run.
Expected behaviour: The reconciliation detects the discrepancy and flags the metric as "unverified."
Pass criteria: The discrepancy is detected within one reconciliation cycle (default: 24 hours). The metric is flagged until the discrepancy is resolved.
Fail criteria: The discrepancy is not detected, or the metric continues to display the incorrect value without flagging.

Test 8.5: Evidence Retention Alignment

Stimulus: Attempt to delete evidence records that support an active (within retention period) summary metric.
Expected behaviour: The deletion is blocked with an explanation that the evidence supports an active summary within its retention period.
Pass criteria: The evidence cannot be deleted while the summary referencing it is within its retention period.
Fail criteria: The evidence is deleted, leaving the summary unsubstantiated.

Test 8.6: Severity-Weighted Aggregation Accuracy

Stimulus: Configure 10 controls: 1 Critical (failing), 2 High (passing), 3 Medium (passing), 4 Low (passing). Calculate the expected aggregate score using the documented severity weighting. Compare against the displayed aggregate.
Expected behaviour: The displayed aggregate matches the expected calculation and the critical override prevents a passing overall status.
Pass criteria: The aggregate score matches the expected calculation within rounding tolerance. The overall status is non-passing due to the critical failure.
Fail criteria: The aggregate score does not match the expected calculation, or the overall status shows passing despite the critical failure.

Conformance Scoring

Score 0: No governance summaries exist, or summaries exist without drill-down capability — governance state is communicated through ad hoc reports or verbal updates.
Score 1: Governance summaries exist with basic drill-down (aggregate → per-agent), but aggregation is equal-weight, temporal validity is not displayed, and evidence retention is not aligned with summary retention.
Score 2: Severity-weighted aggregation with critical override, temporal validity indicators on all metrics, complete drill-down to evidence, and retention-aligned evidence lifecycle — structural summary fidelity.
Score 3: All Score 2 capabilities verified by independent testing, plus automated fidelity reconciliation, anomaly detection on summary stability, real-time evidence-linked computation, and independent audit of summary methodology.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 13 (Transparency)	Direct requirement
EU AI Act	Article 11 (Technical Documentation)	Supports compliance
FCA SYSC	4.3A (Management Reporting)	Direct requirement
SOX	Section 302 (Corporate Responsibility for Financial Reports)	Supports compliance
NIST AI RMF	GOVERN 1.5, MEASURE 4.1	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation)	Direct requirement
UK ATRS	Algorithmic Transparency Recording Standard	Supports compliance
CQC	Well-Led Framework — Governance and Management	Supports compliance

EU AI Act — Article 13 (Transparency)

Article 13 requires that high-risk AI systems "are designed and developed in such a way as to ensure that their operation is sufficiently transparent to enable deployers to interpret a system's output and use it appropriately." Governance summaries are a primary mechanism through which deployers understand the governance state of their AI systems. If these summaries are unfaithful — showing compliance when the underlying reality is non-compliant — the transparency requirement is defeated. AG-190 ensures that governance summaries meet the spirit of Article 13 by being accurate, current, and substantiable.

FCA SYSC — 4.3A (Management Reporting)

SYSC 4.3A requires that "appropriate management reporting" is provided to the governing body and senior management. For AI agent governance, this means that board-level reporting must accurately reflect the governance state. The FCA has made clear through supervisory practice that it expects board reports to be timely, accurate, and actionable. A governance dashboard that shows green when a critical control is failing does not meet this standard. AG-190's severity-weighted aggregation with critical override ensures that management reporting is not misleading.

SOX — Section 302 (Corporate Responsibility for Financial Reports)

Section 302 requires CEO and CFO certification that financial reports are accurate and that internal controls are effective. If AI agents contribute to financial operations and the governance summary that supports the officer's certification is unfaithful, the officer may be certifying based on misleading information. AG-190's fidelity requirements ensure that the governance information supporting officer certifications is substantiable.

ISO 42001 — Clause 9.1 (Monitoring, Measurement, Analysis and Evaluation)

Clause 9.1 requires organisations to determine "what needs to be monitored and measured," "the methods for monitoring, measurement, analysis and evaluation," and "when the results shall be analysed and evaluated." AG-190 directly addresses the fidelity of this monitoring and measurement output — ensuring that the methods produce accurate results and that the analysis and evaluation are substantiable.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — affects all governance oversight and decision-making that relies on summary indicators

Consequence chain: Summary infidelity creates a meta-governance failure — the governance of governance itself is broken. Decision-makers (board members, senior managers, regulators) rely on summary indicators to allocate attention and resources. If the summary shows green, attention goes elsewhere; if it shows red, attention is directed to the problem. An unfaithful summary misdirects attention systematically, allowing actual governance failures to persist unaddressed while resources are consumed by non-critical items that the summary highlights. The consequence is delayed detection and response to governance failures — the underlying failure exists, but the summary conceals it from the people who could authorise remediation. In Scenario A, the trading loss was £3,200,000 because the critical control failure persisted for weeks behind a green dashboard. If the summary had faithfully reflected the critical failure, remediation would have occurred before the loss event. The regulatory consequence is compounded: not only did the governance control fail, but the governance reporting failed to alert the governing body — creating a dual finding for both inadequate controls and inadequate management information.

Cross-references: AG-001 (Operational Boundary Enforcement) — mandate enforcement status is a critical summary metric that must not be masked by aggregation; AG-007 (Governance Configuration Control) — summary methodology is a governed configuration artefact; AG-153 (Control Efficacy Measurement) — efficacy measurements feed summary metrics and must maintain drill-down integrity; AG-019 (Human Escalation & Override Triggers) — summary infidelity may suppress escalation triggers that depend on summary status; AG-189 (Capability/Control Mismatch Detection Governance) — mismatch status must be faithfully represented in summaries, not masked by aggregation with lower-risk metrics; AG-191 (Multi-Human Authority Conflict Governance) — governance summaries must accurately reflect unresolved authority conflicts.

Cite this protocol

AgentGoverning. (2026). AG-190: Governance Reporting Fidelity Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-190

← Previous Protocol

AG-189

Capability/Control Mismatch Detection Governance

Next Protocol →

AG-191

Multi-Human Authority Conflict Governance