AG-417

Telemetry Sampling Bias Governance

Logging, Observability & Forensics ~23 min read AGS v2.1 · April 2026
EU AI Act SOX FCA NIST ISO 42001

2. Summary

Telemetry Sampling Bias Governance requires that organisations operating AI agents ensure their telemetry sampling strategies do not systematically exclude, underrepresent, or distort the visibility of important failure modes, edge-case behaviours, or safety-critical events. When telemetry pipelines sample a fraction of total events — as is standard practice for high-throughput agent deployments — the sampling algorithm, rate, and stratification criteria must be governed to guarantee that critical events are never dropped, that rare but consequential failure classes are statistically represented, and that the resulting telemetry dataset accurately reflects the true operational distribution of agent behaviour. Without this governance, organisations build observability systems that provide confident but misleading pictures of agent health — systems that systematically under-observe the very events that matter most.

3. Example

Scenario A — Uniform Sampling Drops Rare Safety Violations: A customer-facing insurance agent processes 4.2 million interactions per month. The observability pipeline applies a uniform 1% sampling rate to reduce storage costs, retaining approximately 42,000 sampled interactions per month. The agent has a safety violation rate of 0.03% — roughly 1,260 violations per month across 4.2 million interactions. Under uniform 1% sampling, the expected number of captured violations is approximately 12.6 per month. In practice, statistical variance means some months capture 6 violations, others capture 19. The operations team sees between 6 and 19 violations per month in their dashboards and concludes the violation rate is stable and low. In reality, a model update introduced a new failure mode that increased violations from 1,260 to 3,780 per month — a threefold increase. Under 1% sampling, this increase manifests as a jump from approximately 12 to approximately 38 captured violations per month. The absolute numbers are so small that the signal is indistinguishable from normal variance for three months. During those three months, 9,450 customers receive responses violating the safety constraints, resulting in 23 formal complaints, 4 regulatory referrals, and eventual remediation costs of £612,000.

What went wrong: Uniform sampling at 1% reduced a statistically significant threefold increase in safety violations to a signal that was lost in noise. The sampling strategy treated safety violations identically to routine successful interactions, despite their radically different consequence profile. No stratified or priority-based sampling ensured that safety-critical event classes maintained statistical significance in the sampled dataset.

Scenario B — Head-of-Line Bias in Time-Based Sampling: An enterprise workflow agent orchestrating procurement approvals generates telemetry events throughout each transaction's lifecycle — initiation, validation, approval routing, human review, final execution, and post-execution audit. The telemetry pipeline samples events using a time-based strategy: it captures all events during the first 100 milliseconds of each second, then drops events for the remaining 900 milliseconds. This produces an effective 10% sample rate. However, the agent's architecture processes events sequentially within each second: routine acknowledgements and status updates fire in the first 50 milliseconds, while error handlers, timeout escalations, and fallback invocations fire in the 200-800 millisecond range after the primary processing path fails. The time-based sampling captures 98% of routine status events but only 3% of error and timeout events. The operations dashboard shows a 99.7% success rate when the true success rate is 94.1%. A procurement fraud pattern that consistently triggers timeout escalations (captured at only 3%) goes undetected for 5 months, enabling £2.3 million in fraudulent approvals.

What went wrong: The time-based sampling strategy was correlated with the agent's internal processing order. Error events were systematically under-sampled because they occurred later in the processing cycle than routine events. The sampling bias was invisible — the pipeline faithfully reported what it captured, but what it captured was not representative. No bias analysis validated that the sampling strategy was independent of event criticality.

Scenario C — Cardinality Collapse Hides Long-Tail Failures: A financial-value agent operating across 340 currency pairs generates telemetry tagged with pair identifiers. The telemetry aggregation pipeline applies a cardinality limit of 50 unique tag values per metric to prevent storage explosion. The top 50 currency pairs by volume account for 97.3% of transactions. The remaining 290 pairs are collapsed into an "other" bucket. A systematic pricing error affects 14 of these low-volume pairs, producing incorrect valuations on every transaction. Because these 14 pairs are aggregated into the "other" bucket alongside 276 correctly functioning pairs, the error signal is diluted below alerting thresholds. The pricing error persists for 7 weeks, affecting 4,200 transactions with a cumulative valuation error of £890,000. Post-incident analysis reveals the telemetry system never surfaced the error because cardinality limits systematically hid long-tail behaviour.

What went wrong: Cardinality limits — a standard operational practice for managing telemetry costs — created a systematic bias against low-volume entities. The aggregation into "other" buckets destroyed the granularity needed to detect failures affecting individual long-tail members. No governance required that cardinality-limited metrics preserve the ability to detect anomalies in the collapsed population.

4. Requirement Statement

Scope: This dimension applies to every AI agent deployment that uses sampled telemetry — any pipeline where fewer than 100% of operational events are retained for analysis, alerting, and forensic investigation. This includes explicit sampling (e.g., 1% or 10% sample rates), implicit sampling through aggregation (cardinality limits, bucketing, rollups), time-based sampling (capturing events only during specific windows), and reservoir sampling (maintaining fixed-size buffers that discard events under load). The scope covers the full telemetry chain from event generation at the agent runtime through collection, sampling, aggregation, storage, and presentation in dashboards and alerting systems. Organisations that retain 100% of telemetry events are not exempt — they must still validate that aggregation and presentation layers do not introduce bias. The fundamental question is: does the telemetry that reaches human operators and automated alerting systems accurately represent the true distribution of agent behaviour, including rare but critical failure modes?

4.1. A conforming system MUST classify all telemetry event types into criticality tiers, with safety violations, compliance breaches, financial errors, and events classified as critical under AG-409 assigned to the highest tier, and MUST apply tier-aware sampling that guarantees 100% capture of the highest-criticality tier regardless of overall sampling rate.

4.2. A conforming system MUST validate that the active sampling strategy does not exhibit systematic correlation with event criticality, event timing, event source, entity cardinality, or error characteristics, through documented bias analysis performed at initial deployment and after every change to the sampling configuration or the agent's operational profile.

4.3. A conforming system MUST ensure that when cardinality limits, bucketing, or aggregation are applied to telemetry metrics, anomaly detection capability is preserved for the aggregated ("other" or "overflow") population, either through separate monitoring of the aggregated bucket or through periodic full-cardinality sampling windows.

4.4. A conforming system MUST define minimum statistical significance thresholds for each critical event class, specifying the minimum number of sampled events per reporting period required to detect a defined magnitude of change (e.g., a twofold increase) with a specified confidence level (e.g., 95%), and MUST configure sampling rates to meet these thresholds.

4.5. A conforming system MUST implement sampling coverage validation that periodically compares the distribution of sampled events against a full-capture baseline (obtained through periodic 100% capture windows or statistical estimation) to detect sampling drift or emergent bias.

4.6. A conforming system MUST retain unsampled metadata for dropped events — at minimum, event type, criticality tier, timestamp, and source identifier — to enable post-hoc reconstruction of the true event distribution even when full event payloads are not retained.

4.7. A conforming system SHOULD implement adaptive sampling that increases capture rates for event classes whose sampled counts fall below the minimum statistical significance threshold, ensuring that rare event classes are oversampled relative to their natural frequency.

4.8. A conforming system SHOULD apply stratified sampling across agent instances, geographic regions, user segments, and operational contexts, ensuring that no stratum is systematically under-represented in the sampled dataset.

4.9. A conforming system SHOULD implement real-time sampling bias indicators on operational dashboards, showing the current effective sampling rate for each criticality tier and flagging strata where sampled counts are below statistical significance thresholds.

4.10. A conforming system MAY implement shadow telemetry pipelines that process a parallel full-capture stream for critical event classes, independent of the primary sampled pipeline, providing a secondary detection path for events that sampling might miss.

5. Rationale

Telemetry sampling is an economic necessity for high-throughput AI agent deployments. An agent processing millions of interactions per month generates terabytes of telemetry data. Retaining, indexing, and querying 100% of this data is prohibitively expensive for most organisations — storage costs, query latency, and pipeline throughput all degrade as volume increases. Sampling is the standard engineering solution: capture a representative fraction, discard the rest, and make operational decisions based on the sample. This approach works well for high-frequency, low-consequence events. It fails catastrophically for low-frequency, high-consequence events.

The fundamental problem is that the events organisations most need to observe are precisely the events that sampling is most likely to miss. Safety violations, compliance breaches, and novel failure modes are rare by definition — if they were frequent, they would have been caught and fixed. A 1% sampling rate applied uniformly means that a failure occurring once per 10,000 interactions produces, on average, one sampled instance per million interactions. At 4 million interactions per month, that is approximately 4 captured instances per month — a number so small that statistical noise dominates any trend signal. The failure could triple in frequency and the change would be invisible in the sampled data.

This is not a theoretical concern. The observability engineering community has documented the "streetlight effect" in production telemetry: organisations observe what their sampling captures and remain blind to what it misses. Dashboards show green because the sampled data is green — but the sampled data is not representative. The bias is systematic and invisible: the telemetry pipeline faithfully reports what it captures, and what it captures is biased toward common, successful, uneventful interactions.

Three specific bias mechanisms require governance. First, frequency bias: uniform sampling inherently over-represents frequent events and under-represents rare events. For a uniform 1% sample, an event occurring 1 million times per day contributes approximately 10,000 samples — a highly reliable signal. An event occurring 100 times per day contributes approximately 1 sample — useless for trend detection. Second, temporal bias: time-based sampling strategies can correlate with the agent's internal processing order, systematically capturing events that occur at specific points in the processing cycle while missing events at other points. Error handling, timeout processing, and fallback logic often execute at different times than the primary processing path. Third, cardinality bias: aggregation strategies that collapse low-volume entities into "other" buckets systematically hide failures affecting long-tail entities. In a system monitoring 340 currency pairs where only 50 are tracked individually, failures affecting the other 290 are invisible.

Regulatory frameworks increasingly require demonstrable observability of AI system behaviour. The EU AI Act Article 12 mandates logging capabilities that enable monitoring throughout the system's lifetime. If the logging system is biased — if it systematically misses certain failure classes — it does not satisfy this requirement regardless of its technical sophistication. Similarly, SOX Section 404 requires that internal controls be effective. A monitoring control that is blind to a class of financial errors due to sampling bias is not an effective control. The governance of sampling bias is therefore not merely an operational optimisation concern — it is a compliance requirement wherever regulators mandate demonstrable monitoring of AI system behaviour.

The interaction with AG-409 (Critical Event Taxonomy Governance) is direct: AG-409 defines which events are critical, and AG-417 ensures those critical events are not lost to sampling. Without AG-409, there is no classification to drive tier-aware sampling. Without AG-417, the classification exists but the sampling pipeline ignores it. The interaction with AG-022 (Behavioural Drift Detection) is equally direct: drift detection algorithms operate on sampled telemetry. If the sampling is biased, the drift detection is biased — it will detect drift in well-sampled event classes but miss drift in under-sampled classes.

6. Implementation Guidance

Telemetry Sampling Bias Governance requires organisations to treat their sampling strategy as a governed artefact — not merely an infrastructure configuration parameter. The sampling rate, stratification criteria, cardinality limits, and aggregation rules collectively determine what the organisation can and cannot observe. Changes to these parameters change the organisation's observability posture and must be assessed for their impact on critical event visibility.

Recommended patterns:

Anti-patterns to avoid:

Industry Considerations

Financial Services. Financial agents processing transactions, valuations, and compliance checks generate telemetry where even rare errors have direct monetary impact. A pricing error affecting 0.01% of transactions in a high-volume trading agent can accumulate millions in losses. Sampling strategies must guarantee statistical significance for every financial error category. FCA expectations for trade surveillance and transaction monitoring require that monitoring systems can detect market abuse patterns — patterns that may affect a small fraction of total transactions.

Healthcare and Life Sciences. Clinical decision support agents must capture every adverse event indicator at 100%. Sampling strategies that might miss a drug interaction warning or misdiagnosis signal are unacceptable. Regulatory frameworks for medical devices require complete event capture for safety-critical categories.

Crypto and Web3. Decentralised agent deployments often operate across multiple chains and protocols with highly variable transaction volumes. Sampling must account for the extreme variance in event frequency across chains and the possibility of targeted manipulation on low-volume chains where sampled coverage is weakest.

Safety-Critical and CPS. Embodied agents and cyber-physical systems require deterministic telemetry capture for all safety-relevant events. Sampling is generally inappropriate for safety-critical event classes — these must be captured at 100% with guaranteed delivery to the monitoring system.

Maturity Model

Basic Implementation — The organisation has classified telemetry events into criticality tiers aligned with AG-409. Critical events are captured at 100% regardless of overall sampling rate. A documented bias analysis exists for the current sampling configuration. Minimum statistical significance thresholds are defined for each critical event class. Unsampled event metadata is retained. This level meets the mandatory requirements and prevents the most severe sampling bias failures.

Intermediate Implementation — All basic capabilities plus: periodic full-capture baseline windows validate the sampled distribution against reality. Cardinality-aware anomaly detection monitors collapsed "other" buckets for hidden failures. Sampling coverage validation runs automatically and flags drift between sampled and baseline distributions. Stratified sampling ensures representation across agent instances, regions, and user segments. Real-time dashboard indicators show effective sampling rates per criticality tier.

Advanced Implementation — All intermediate capabilities plus: adaptive sampling dynamically adjusts capture rates for event classes falling below statistical significance thresholds. Shadow telemetry pipelines provide independent full-capture monitoring for critical event classes. The organisation can demonstrate through testing that no known failure mode is systematically hidden by the sampling strategy. Sampling bias metrics are included in the regular governance reporting cycle and reviewed by the governance board.

7. Evidence Requirements

Required artefacts:

Retention requirements:

Access requirements:

8. Test Specification

Test 8.1: Critical Event Guaranteed Capture

Test 8.2: Sampling Bias Independence Verification

Test 8.3: Statistical Significance Threshold Enforcement

Test 8.4: Cardinality Collapse Anomaly Detection

Test 8.5: Unsampled Metadata Retention and Reconstruction

Test 8.6: Sampling Coverage Validation Against Baseline

Test 8.7: Tier-Aware Sampling Configuration Enforcement

Conformance Scoring

9. Regulatory Mapping

RegulationProvisionRelationship Type
EU AI ActArticle 12 (Record-Keeping / Logging)Direct requirement
EU AI ActArticle 9 (Risk Management System)Supports compliance
SOXSection 404 (Internal Controls Over Financial Reporting)Supports compliance
FCA SYSC6.1.1R (Systems and Controls)Direct requirement
NIST AI RMFMEASURE 2.1, MEASURE 2.6Supports compliance
ISO 42001Clause 9.1 (Monitoring, Measurement, Analysis)Direct requirement
DORAArticle 10 (Detection)Direct requirement

EU AI Act — Article 12 (Record-Keeping / Logging)

Article 12 requires that high-risk AI systems include logging capabilities that enable monitoring of the system's operation throughout its lifetime. The logging must be adequate to enable post-market monitoring and to facilitate the identification of risks. Telemetry sampling that systematically misses certain failure classes directly undermines this requirement — the logging exists but is biased, creating an illusion of adequate monitoring while systematically under-observing critical events. Organisations must demonstrate that their sampling strategies preserve the ability to detect the full range of operational failures, not merely the frequent ones. A biased sampling configuration that renders safety violations statistically invisible is a non-conforming logging implementation regardless of the volume of data it does capture.

SOX — Section 404 (Internal Controls Over Financial Reporting)

Financial transaction monitoring controls that rely on sampled telemetry must demonstrate that the sampling provides adequate coverage of financial error categories. An auditor assessing internal controls over financial reporting will evaluate whether the monitoring controls can detect material misstatements. If the sampling strategy renders a class of financial errors statistically undetectable — because the error is rare and the sampling rate is too low — the monitoring control is ineffective for that error class, potentially constituting a significant deficiency or material weakness.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to maintain adequate systems and controls for the management of their business. For AI agents operating in financial services, telemetry monitoring is a key control. The FCA has emphasised that monitoring must be effective in practice, not merely documented in policy. A sampling strategy that systematically under-observes certain failure modes — particularly those affecting consumer outcomes or market integrity — does not constitute adequate monitoring. Firms must demonstrate that their sampling provides statistically meaningful coverage of all risk-relevant event classes.

NIST AI RMF — MEASURE 2.1, MEASURE 2.6

MEASURE 2.1 addresses whether AI system performance is evaluated regularly, and MEASURE 2.6 addresses operational monitoring. Both functions depend on the quality of the data used for measurement and monitoring. Biased telemetry sampling produces biased performance evaluations — the organisation believes the system is performing well because its monitoring data, shaped by biased sampling, supports that conclusion. AG-417 ensures that the data feeding NIST AI RMF measurement functions accurately represents the system's true operational behaviour.

ISO 42001 — Clause 9.1 (Monitoring, Measurement, Analysis)

ISO 42001 Clause 9.1 requires organisations to determine what needs to be monitored and measured, and when monitoring and measuring shall be performed. AG-417 addresses a prerequisite for effective monitoring: that the data collected through monitoring is representative and unbiased. An organisation conforming to ISO 42001 that relies on biased telemetry has a gap between its monitoring intent (what it believes it is monitoring) and its monitoring reality (what the biased sample actually observes).

DORA — Article 10 (Detection)

DORA Article 10 requires financial entities to have mechanisms to promptly detect anomalous activities. Telemetry sampling bias directly impacts detection capability. If the sampling strategy systematically under-samples the event classes where anomalies are most likely to manifest — rare errors, timeout patterns, fallback invocations — the detection mechanism fails. Organisations must demonstrate that their detection capabilities extend to all relevant event classes, including those that are rare in normal operation.

10. Failure Severity

FieldValue
Severity RatingHigh
Blast RadiusOrganisation-wide — affects the validity of all observability, alerting, and monitoring systems that consume sampled telemetry, potentially masking failures across every deployed agent

Consequence chain: Biased telemetry sampling creates a systematic blind spot in the organisation's observability posture. The immediate failure is statistical: critical event classes are under-represented in the sampled dataset, meaning dashboards, alerts, and reports based on sampled data are misleading. The operational consequence is delayed or missed detection of safety violations, compliance breaches, financial errors, and novel failure modes. Operators see green dashboards and conclude that agents are healthy, while unseen failures accumulate. The business consequence compounds over time: each day of undetected failure increases the remediation cost, the regulatory exposure, and the customer harm. When the failure is eventually discovered — typically through a customer complaint, an audit finding, or an incident severe enough to be noticed despite the sampling bias — the organisation faces both the direct cost of the accumulated failures and the meta-failure of demonstrating that its monitoring was inadequate. In regulated sectors, this meta-failure — the inability to demonstrate effective monitoring — is itself a compliance violation, triggering regulatory action independent of the underlying failures that went undetected. The interaction with AG-022 (Behavioural Drift Detection) amplifies the consequence: if drift detection operates on biased telemetry, it cannot detect drift in under-sampled event classes, meaning the organisation's drift detection and sampling bias create mutually reinforcing blind spots.

Cross-references: AG-409 (Critical Event Taxonomy Governance) provides the event classification that drives tier-aware sampling. AG-022 (Behavioural Drift Detection) consumes sampled telemetry and is directly affected by sampling bias. AG-410 (High-Cardinality Trace Retention Governance) addresses retention of high-cardinality trace data that AG-417 ensures is adequately sampled. AG-413 (Observer-of-Observer Integrity Governance) monitors the health of the observability pipeline itself, including sampling components. AG-414 (Alert Deduplication Governance) operates downstream of sampling — alerts can only fire on events that survive sampling. AG-418 (Cross-System Trace Correlation Governance) requires that correlated events across systems are consistently sampled to maintain trace completeness.

Cite this protocol
AgentGoverning. (2026). AG-417: Telemetry Sampling Bias Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-417