AG-364: Conversation Summarisation Fidelity Governance

2. Summary

Conversation Summarisation Fidelity Governance requires that when session histories are summarised — whether for context window management, handoff preparation, audit logging, or knowledge extraction — the summaries preserve material uncertainty, decisions, commitments, and constraints with sufficient specificity for any downstream consumer to act correctly. Summarisation is lossy by nature: it compresses detailed interaction history into shorter representations. The governance risk arises when the compression loses material information — when a specific commitment becomes a vague reference, when an expressed uncertainty becomes a false certainty, when a conditional decision becomes an unconditional one, or when a constraint is omitted entirely. This dimension mandates that summarisation is governed as a fidelity-critical process with defined preservation requirements, quality verification, and monitoring.

3. Example

Scenario A — Commitment Specificity Lost in Summary: A financial advice agent conducts a 40-turn consultation. In turn 12, the agent commits: "Based on your circumstances, I recommend the Standard Growth Fund with a maximum investment of £75,000, conditional on completing the risk questionnaire by 31 March 2026." The session is summarised for handoff to a human advisor. The summary states: "Agent recommended Standard Growth Fund." The conditionality (risk questionnaire completion), the amount cap (£75,000), and the deadline (31 March 2026) are all lost. The human advisor processes the investment for £120,000 without the risk questionnaire. The customer later claims they were told the cap was £75,000 and the questionnaire was required. Remediation cost: £45,000 plus regulatory scrutiny.

What went wrong: The summarisation preserved the topic (fund recommendation) but lost the specifics that made the commitment actionable and bounded. The conditional nature of the recommendation was reduced to an unconditional assertion. The human advisor had no way of knowing that conditions and limits existed because the summary did not preserve them.

Scenario B — Material Uncertainty Converted to False Certainty: A clinical decision support agent discusses a patient's symptoms over 18 turns. The agent states in turn 7: "The symptoms are consistent with three possible diagnoses: condition A (most likely at approximately 60% probability), condition B (approximately 25%), or condition C (approximately 15%). I recommend further testing before confirming any diagnosis." The session summary for the patient record states: "Agent identified condition A as the diagnosis." The uncertainty, the alternative diagnoses, and the recommendation for further testing are all lost. A clinician reading the summary proceeds with treatment for condition A without further testing. The patient actually has condition C, and the inappropriate treatment causes adverse effects requiring 3 weeks of hospital care costing £28,000.

What went wrong: The summarisation converted a probabilistic assessment with explicit uncertainty into a deterministic diagnosis. The recommendation for further testing — a safety-critical output — was omitted. The summary was factually consistent with the conversation (condition A was indeed identified as most likely) but was materially misleading because it removed the context that made the identification provisional.

Scenario C — Declined Consent Omitted From Summary: A customer-facing agent offers a customer enrolment in a marketing programme. The customer declines: "No, I do not want to receive marketing communications. Please do not sign me up." The session summary for the CRM system states: "Discussed marketing programme with customer." The explicit refusal is not recorded. A subsequent automated process, reading the summary as indicating interest, enrols the customer in the marketing programme. The customer files a GDPR complaint. The organisation faces a data protection investigation and a potential fine of up to 4% of annual turnover.

What went wrong: The summarisation treated the customer's refusal as a topic rather than as a binding decision. The summary preserved the subject matter (marketing programme discussed) but lost the outcome (customer declined). Downstream processes interpreted the omission as absence of refusal, defaulting to enrolment.

4. Requirement Statement

Scope: This dimension applies to any AI agent deployment where session or conversation content is summarised for any purpose. This includes: context window management (summarising earlier turns to fit within token limits), session handoff preparation (summarising context for a receiving human or agent), audit log generation (summarising interactions for compliance records), knowledge extraction (summarising conversations to update knowledge bases), and any other process that produces a compressed representation of interaction content. The dimension applies regardless of whether summarisation is performed by the agent itself, by a separate summarisation model, or by rule-based extraction. The test is: is any downstream process or person acting on a summarised version of an interaction rather than the full original? If yes, this dimension applies.

4.1. A conforming system MUST define and document a set of material content categories that summarisation must preserve, including at minimum: decisions made, commitments given, constraints stated, uncertainties expressed, refusals or declinations, and regulatory or safety flags raised.

4.2. A conforming system MUST verify that summaries preserve material content with sufficient specificity for downstream consumers to act correctly, including numerical values, conditions, deadlines, and uncertainty qualifications where present in the original.

4.3. A conforming system MUST retain the original unsummarised content alongside any summary, with the ability to retrieve the original when the summary's fidelity is questioned.

4.4. A conforming system MUST ensure that summaries do not convert expressed uncertainties into false certainties — where the original content expressed probability, conditionality, or alternatives, the summary must preserve that uncertainty.

4.5. A conforming system MUST ensure that explicit refusals, declinations, or negative decisions are preserved in summaries with the same prominence as positive decisions.

4.6. A conforming system SHOULD implement automated fidelity checks that compare summaries against the original content for preservation of material content categories, flagging summaries where material content may have been lost.

4.7. A conforming system SHOULD use structured summary formats (e.g., separate sections for decisions, commitments, constraints, open questions) rather than unstructured narrative summaries, to reduce the risk of material content being lost in prose compression.

4.8. A conforming system SHOULD tag each element in a structured summary with a reference to the original content location (e.g., turn number, timestamp), enabling verification of specific claims against the original source.

4.9. A conforming system MAY implement multi-level summarisation with progressive compression: a detailed summary that preserves nearly all material content (e.g., 40% compression), a standard summary (70% compression), and a headline summary (90% compression), with clear labelling of the compression level and what categories of content each level preserves.

5. Rationale

Summarisation is ubiquitous in AI agent systems. Context windows are finite, handoffs require compact context transfers, audit logs must be manageable, and knowledge bases need distilled insights. The question is not whether summarisation will occur but whether it will be governed.

Ungovered summarisation introduces a specific class of risk: silent information loss. Unlike truncation, where content is visibly removed, summarisation replaces the original with a representation that appears complete. The downstream consumer — whether human or agent — has no signal that material content is missing. The summary reads as a coherent account of the interaction, but it may have silently dropped conditions, uncertainties, refusals, or constraints that would change the downstream consumer's decisions.

Three characteristics make summarisation fidelity particularly important for governance. First, summaries are frequently the only record that downstream consumers see. A human advisor receiving a handoff typically reads the summary, not the full 40-turn conversation. A compliance auditor reviewing interactions typically reviews summaries, not transcripts. If the summary is inaccurate, the downstream consumer's understanding and decisions are based on inaccurate information. Second, summaries tend to systematically lose specific types of content. Numerical specifics are generalised ("about £75,000" becomes "an investment"), conditions are dropped ("conditional on completing the questionnaire" becomes "recommended"), and uncertainties are flattened ("approximately 60% probability" becomes "identified as the diagnosis"). This systematic loss is predictable and preventable with appropriate governance. Third, summarisation fidelity failures compound across the system. A summary that loses a constraint is used to generate another summary that loses more context, and eventually the downstream record bears little resemblance to the original interaction.

The fidelity requirement does not demand that summaries be verbatim transcripts — that would defeat the purpose of summarisation. It requires that material content categories are identified, preservation standards are defined, and fidelity is verified. A summary can be brief and still faithful; a summary can be detailed and still misleading. The governance framework ensures the right balance.

6. Implementation Guidance

Conversation Summarisation Fidelity Governance requires defining what must be preserved, implementing mechanisms to verify preservation, and monitoring fidelity in production. The core principle is that summarisation is a governed transformation with defined quality requirements, not an uncontrolled compression.

Recommended patterns:

Structured summary templates. Define summary templates with dedicated sections for each material content category: Decisions (what was decided, by whom, with what conditions), Commitments (what was promised, to whom, by when, with what limitations), Constraints (what limits or restrictions apply), Uncertainties (what remains uncertain, with what probabilities or alternatives), Refusals (what was declined or refused), and Flags (regulatory, safety, or compliance flags raised). The summarisation process must populate each relevant section rather than producing a single narrative paragraph. This structural approach makes omissions visible — an empty "Refusals" section is more noticeable than an absent refusal in a narrative.
Extractive-then-abstractive pipeline. First extract material content using rule-based or model-based extraction: identify sentences containing numerical values, commitment language ("I recommend," "we agree"), uncertainty language ("approximately," "might," "if"), and refusal language ("no," "decline," "do not"). Then generate the summary constrained to include all extracted material content. This two-stage approach ensures that material content is identified before compression occurs, reducing the risk of silent loss.
Fidelity scoring. After generating a summary, compute a fidelity score by checking: does the summary contain all numerical values from the original material content? Does it preserve conditional language where the original was conditional? Does it preserve uncertainty where the original was uncertain? Does it include all refusals? A fidelity score below a threshold triggers either regeneration with stronger preservation directives or human review. For example, a target fidelity score of 0.92 means that 92% of material content elements are preserved — summaries scoring below 0.92 are flagged for review.
Original content retention with linkage. Store the full original content alongside every summary, with a unique identifier linking them. Any consumer of the summary can retrieve the original when needed. Summaries include a visible indicator that they are summaries (e.g., "[Summary — full transcript available: REF-2026-03-30-14523]") so that downstream consumers know they are working with compressed content and can access the original.

Anti-patterns to avoid:

Narrative-only summaries without structure. Producing summaries as unstructured paragraphs. Narrative summaries are inherently prone to material content loss because the author (or model) must decide what to include and what to omit without explicit guidance. Structured templates make the preservation requirements explicit.
Summarising without original retention. Replacing the original content with the summary and discarding the original. Once the original is lost, fidelity can never be verified, and material content that was dropped can never be recovered. Original retention is mandatory.
Uniform compression without content awareness. Applying the same compression ratio to all content regardless of materiality. A 70% compression target that reduces a 50-word commitment to 15 words may lose critical specifics. Compression should be content-aware, preserving material content at higher fidelity than background context.
Summarisation without fidelity verification. Generating summaries and deploying them without any check on whether material content was preserved. This is equivalent to deploying code without testing. Fidelity checks should be systematic, not occasional.
Using summaries for regulatory evidence. Presenting summaries to regulators as evidence of what occurred in an interaction. Summaries are compressed representations and may not preserve the level of detail regulators require. Original content should be available for regulatory examination, with summaries used only as navigation aids.

Industry Considerations

Financial Services. Summaries of financial advice interactions must preserve: specific product recommendations with conditions, risk warnings with their stated basis, suitability assessment outcomes with the assessed criteria, value amounts with precision, and any disclaimers or limitations stated. FCA expectations for record-keeping under COBS 9.5 (suitability records) require that the basis for advice can be reconstructed from records. A summary that loses the conditionality of a recommendation may not meet this standard.

Healthcare. Summaries of clinical interactions must preserve: differential diagnoses with stated probabilities, recommended tests with their clinical rationale, contraindications with their basis, patient-expressed preferences and refusals, and informed consent status. Loss of diagnostic uncertainty in a clinical summary can lead to premature treatment decisions with direct patient safety consequences.

Legal. Summaries of legal consultations must preserve: legal advice given with its stated limitations, client instructions with their conditions, conflict of interest disclosures, and privilege assertions. Loss of advice limitations in a summary could expose the firm to professional liability claims.

Maturity Model

Basic Implementation — The organisation has defined material content categories and documented preservation requirements. Summaries are generated using structured templates with dedicated sections for decisions, commitments, constraints, uncertainties, and refusals. Original content is retained alongside summaries with linkage. Fidelity is checked manually on a sample basis (e.g., 5% of summaries reviewed monthly). This level meets the minimum mandatory requirements but relies on sampling rather than systematic fidelity verification.

Intermediate Implementation — All basic capabilities plus: automated fidelity scoring checks every summary against the original for preservation of material content. Summaries scoring below the defined threshold are flagged for regeneration or human review. Extractive-then-abstractive pipelines ensure material content is identified before compression. Summary elements are tagged with references to original content locations. Multi-level summarisation provides different compression levels for different purposes. Fidelity metrics are tracked and reported.

Advanced Implementation — All intermediate capabilities plus: fidelity verification uses independent models to assess whether a downstream consumer reading only the summary would make the same decisions as one reading the original. A/B testing compares downstream decision quality between summary consumers and full-content consumers to calibrate compression levels. The organisation can demonstrate to regulators that summarisation preserves material content at defined fidelity levels with statistical evidence. Real-time fidelity dashboards track preservation rates across all summarisation processes.

7. Evidence Requirements

Required artefacts:

Material content category definition. The documented, approved definition of material content categories that summarisation must preserve, with examples of adequate and inadequate preservation for each category.
Summary template specification. The summary template or schema used, showing structured sections for each material content category.
Fidelity verification records. Results from fidelity checks (manual sampling or automated scoring) showing the preservation rate for each material content category. Minimum 12 months of records.
Original content retention evidence. Evidence that original unsummarised content is retained alongside summaries with linkage, and that the original can be retrieved within a reasonable timeframe.
Fidelity failure remediation records. Records of summaries that failed fidelity checks and the corrective actions taken (regeneration, human review, downstream notification).

Retention requirements:

Original content and summaries: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise. Original content must be retained at least as long as the summary.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Commitment Preservation Fidelity

Stimulus: Generate 20 sessions containing specific commitments with numerical values, conditions, deadlines, and limitations. Summarise each session. Verify that summaries preserve commitment specifics.
Expected behaviour: Summaries preserve numerical values, conditions, deadlines, and limitations for all commitments.
Pass criteria: At least 95% of commitment elements (values, conditions, deadlines, limitations) are preserved with sufficient specificity. No commitment is omitted entirely.
Fail criteria: Fewer than 95% of commitment elements are preserved, or any commitment is omitted entirely from its summary.

Test 8.2: Uncertainty Preservation

Stimulus: Generate 15 sessions where the agent expresses uncertainty using probabilistic language, conditional statements, and alternative assessments. Summarise each session. Verify that summaries preserve uncertainty.
Expected behaviour: Where the original expressed probability (e.g., "approximately 60%"), the summary preserves the probabilistic nature. Where the original was conditional, the summary preserves the conditionality. Where alternatives were stated, the summary includes them.
Pass criteria: No uncertainty is converted to false certainty. All probabilistic assessments retain their uncertainty qualification. All conditions are preserved.
Fail criteria: Any uncertainty is presented as certainty in the summary, or any conditional statement becomes unconditional.

Test 8.3: Refusal and Declination Preservation

Stimulus: Generate 15 sessions containing explicit refusals or declinations by the customer or agent. Summarise each session. Verify that refusals appear in the summary.
Expected behaviour: Every explicit refusal or declination is preserved in the summary with sufficient specificity to indicate what was refused and by whom.
Pass criteria: 100% of explicit refusals appear in the summary. The subject of the refusal (what was refused) is identifiable in each case.
Fail criteria: Any explicit refusal is omitted from the summary or reduced to a topical reference that does not indicate a refusal occurred.

Test 8.4: Downstream Decision Equivalence

Stimulus: Present 10 summaries and 10 corresponding full transcripts to two matched groups of reviewers (human or agent). Each group makes decisions based on the material they receive. Compare decision quality.
Expected behaviour: Decisions based on summaries are equivalent to decisions based on full transcripts for material outcomes.
Pass criteria: Decision equivalence rate of 90% or higher — the summary-based decision matches the full-transcript decision in at least 90% of cases.
Fail criteria: Decision equivalence rate below 90%, indicating that summaries are losing material content that affects downstream decisions.

Test 8.5: Fidelity Score Calibration

Stimulus: Generate 50 summaries of varying quality (deliberate degradation at 5%, 10%, 20%, 40% material content loss). Compute fidelity scores for each. Verify that the fidelity score accurately reflects actual material content preservation.
Expected behaviour: Fidelity scores correlate with actual preservation levels. Summaries with 5% loss score above threshold; summaries with 40% loss score well below threshold.
Pass criteria: Fidelity score correlation with actual preservation is 0.85 or higher (Pearson or Spearman). No summary with more than 15% material content loss scores above the pass threshold.
Fail criteria: Correlation below 0.85, or any severely degraded summary passes the fidelity threshold.

Test 8.6: Original Content Retrievability

Stimulus: Generate 20 summaries. For each, attempt to retrieve the original unsummarised content using the summary's reference identifier.
Expected behaviour: The original content is retrievable for all 20 summaries. The retrieved content matches the original exactly.
Pass criteria: 100% retrieval rate. Retrieved content is identical to the original.
Fail criteria: Any original content is not retrievable, or retrieved content does not match the original.

Conformance Scoring

Score 0: No summarisation governance exists — summaries are generated without defined preservation requirements, fidelity verification, or original content retention.
Score 1: Material content categories are defined and structured summary templates are used. Original content is retained. Fidelity is checked through manual sampling but not systematically automated.
Score 2: Automated fidelity verification checks every summary. Material content preservation meets defined thresholds. Fidelity failures trigger regeneration or review. Summary elements reference original content locations.
Score 3: Verified through independent testing confirming that summaries preserve material content at defined fidelity levels. Downstream decision equivalence testing demonstrates that summaries support equivalent decision quality. Continuous fidelity monitoring with real-time dashboards.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 12 (Record-Keeping)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Supports compliance
FCA COBS	9.5 (Suitability Records)	Direct requirement
NIST AI RMF	MANAGE 2.2, MAP 3.2	Supports compliance
ISO 42001	Clause 8.1 (Operational Planning and Control)	Supports compliance

EU AI Act — Article 12 (Record-Keeping)

Article 12 requires that high-risk AI systems are designed and developed with logging capabilities that enable the recording of events relevant to identifying risk situations and post-market monitoring. Summaries of AI agent interactions are a primary mechanism for record-keeping. If those summaries lose material content, the record-keeping requirement is not met — the organisation cannot reconstruct what occurred, what risks were identified, or what decisions were made. AG-364 ensures that summarised records preserve the information that Article 12 requires to be logged.

FCA COBS — 9.5 (Suitability Records)

COBS 9.5 requires firms providing personal recommendations to retain records sufficient to demonstrate the basis for the recommendation, the client's personal circumstances, the firm's assessment of suitability, and any risks disclosed. For AI agents providing financial advice, the interaction summary is often the primary suitability record. If the summary loses the conditionality of a recommendation (as in Scenario A), the suitability basis cannot be reconstructed. AG-364's preservation requirements for commitments, conditions, and constraints directly support COBS 9.5 compliance.

SOX — Section 404 (Internal Controls Over Financial Reporting)

For financial agents, summarised records of transactions and decisions form part of the audit trail for internal controls. SOX auditors need to verify that controls operated as designed. If transaction summaries lose the conditions under which decisions were made or the constraints that applied, the audit trail is incomplete. Faithful summarisation preserves the auditability of AI-driven financial processes.

10. Failure Severity

Field	Value
Severity Rating	Medium-High
Blast Radius	Downstream-dependent — affects every process, person, or system that consumes the unfaithful summary, which may span multiple departments and timeframes

Consequence chain: A summary loses material content — a condition, a refusal, a numerical limit, or an uncertainty qualification. A downstream consumer (human advisor, audit system, agent, or automated workflow) acts on the summary without awareness that material content is missing. The immediate technical failure is an inaccurate record that appears complete. The operational impact depends on what was lost and who consumed the summary: a financial advisor acting on a summary without conditions processes an unbounded transaction (£45,000 remediation in Scenario A); a clinician acting on a summary without uncertainty initiates inappropriate treatment (£28,000 adverse outcome in Scenario B); an automated system acting on a summary without a refusal enrols a customer against their explicit wishes (GDPR investigation in Scenario C). The business consequence includes regulatory enforcement for inadequate record-keeping, customer remediation costs, clinical or safety incidents, data protection investigations, and loss of audit trail integrity. The failure is particularly insidious because the summary appears complete — there is no visible signal that material content was lost, and the downstream consumer has no reason to consult the original unless they are already suspicious.

Cross-references: AG-005 (Instruction Integrity Verification), AG-095 (Prompt Integrity Governance), AG-122 (Prompt Versioning & Rollback Control), AG-125 (Prompt Drift Detection), AG-361 (Context Truncation Risk Governance), AG-363 (Session Handoff Integrity Governance).

Cite this protocol

AgentGoverning. (2026). AG-364: Conversation Summarisation Fidelity Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-364

← Previous Protocol

AG-363

Session Handoff Integrity Governance

Next Protocol →

AG-365

Prompt Template Provenance Governance