AG-417: Telemetry Sampling Bias Governance

2. Summary

Telemetry Sampling Bias Governance requires that organisations operating AI agents ensure their telemetry sampling strategies do not systematically exclude, underrepresent, or distort the visibility of important failure modes, edge-case behaviours, or safety-critical events. When telemetry pipelines sample a fraction of total events — as is standard practice for high-throughput agent deployments — the sampling algorithm, rate, and stratification criteria must be governed to guarantee that critical events are never dropped, that rare but consequential failure classes are statistically represented, and that the resulting telemetry dataset accurately reflects the true operational distribution of agent behaviour. Without this governance, organisations build observability systems that provide confident but misleading pictures of agent health — systems that systematically under-observe the very events that matter most.

3. Example

Scenario A — Uniform Sampling Drops Rare Safety Violations: A customer-facing insurance agent processes 4.2 million interactions per month. The observability pipeline applies a uniform 1% sampling rate to reduce storage costs, retaining approximately 42,000 sampled interactions per month. The agent has a safety violation rate of 0.03% — roughly 1,260 violations per month across 4.2 million interactions. Under uniform 1% sampling, the expected number of captured violations is approximately 12.6 per month. In practice, statistical variance means some months capture 6 violations, others capture 19. The operations team sees between 6 and 19 violations per month in their dashboards and concludes the violation rate is stable and low. In reality, a model update introduced a new failure mode that increased violations from 1,260 to 3,780 per month — a threefold increase. Under 1% sampling, this increase manifests as a jump from approximately 12 to approximately 38 captured violations per month. The absolute numbers are so small that the signal is indistinguishable from normal variance for three months. During those three months, 9,450 customers receive responses violating the safety constraints, resulting in 23 formal complaints, 4 regulatory referrals, and eventual remediation costs of £612,000.

What went wrong: Uniform sampling at 1% reduced a statistically significant threefold increase in safety violations to a signal that was lost in noise. The sampling strategy treated safety violations identically to routine successful interactions, despite their radically different consequence profile. No stratified or priority-based sampling ensured that safety-critical event classes maintained statistical significance in the sampled dataset.

Scenario B — Head-of-Line Bias in Time-Based Sampling: An enterprise workflow agent orchestrating procurement approvals generates telemetry events throughout each transaction's lifecycle — initiation, validation, approval routing, human review, final execution, and post-execution audit. The telemetry pipeline samples events using a time-based strategy: it captures all events during the first 100 milliseconds of each second, then drops events for the remaining 900 milliseconds. This produces an effective 10% sample rate. However, the agent's architecture processes events sequentially within each second: routine acknowledgements and status updates fire in the first 50 milliseconds, while error handlers, timeout escalations, and fallback invocations fire in the 200-800 millisecond range after the primary processing path fails. The time-based sampling captures 98% of routine status events but only 3% of error and timeout events. The operations dashboard shows a 99.7% success rate when the true success rate is 94.1%. A procurement fraud pattern that consistently triggers timeout escalations (captured at only 3%) goes undetected for 5 months, enabling £2.3 million in fraudulent approvals.

What went wrong: The time-based sampling strategy was correlated with the agent's internal processing order. Error events were systematically under-sampled because they occurred later in the processing cycle than routine events. The sampling bias was invisible — the pipeline faithfully reported what it captured, but what it captured was not representative. No bias analysis validated that the sampling strategy was independent of event criticality.

Scenario C — Cardinality Collapse Hides Long-Tail Failures: A financial-value agent operating across 340 currency pairs generates telemetry tagged with pair identifiers. The telemetry aggregation pipeline applies a cardinality limit of 50 unique tag values per metric to prevent storage explosion. The top 50 currency pairs by volume account for 97.3% of transactions. The remaining 290 pairs are collapsed into an "other" bucket. A systematic pricing error affects 14 of these low-volume pairs, producing incorrect valuations on every transaction. Because these 14 pairs are aggregated into the "other" bucket alongside 276 correctly functioning pairs, the error signal is diluted below alerting thresholds. The pricing error persists for 7 weeks, affecting 4,200 transactions with a cumulative valuation error of £890,000. Post-incident analysis reveals the telemetry system never surfaced the error because cardinality limits systematically hid long-tail behaviour.

What went wrong: Cardinality limits — a standard operational practice for managing telemetry costs — created a systematic bias against low-volume entities. The aggregation into "other" buckets destroyed the granularity needed to detect failures affecting individual long-tail members. No governance required that cardinality-limited metrics preserve the ability to detect anomalies in the collapsed population.

4. Requirement Statement

Scope: This dimension applies to every AI agent deployment that uses sampled telemetry — any pipeline where fewer than 100% of operational events are retained for analysis, alerting, and forensic investigation. This includes explicit sampling (e.g., 1% or 10% sample rates), implicit sampling through aggregation (cardinality limits, bucketing, rollups), time-based sampling (capturing events only during specific windows), and reservoir sampling (maintaining fixed-size buffers that discard events under load). The scope covers the full telemetry chain from event generation at the agent runtime through collection, sampling, aggregation, storage, and presentation in dashboards and alerting systems. Organisations that retain 100% of telemetry events are not exempt — they must still validate that aggregation and presentation layers do not introduce bias. The fundamental question is: does the telemetry that reaches human operators and automated alerting systems accurately represent the true distribution of agent behaviour, including rare but critical failure modes?

4.1. A conforming system MUST classify all telemetry event types into criticality tiers, with safety violations, compliance breaches, financial errors, and events classified as critical under AG-409 assigned to the highest tier, and MUST apply tier-aware sampling that guarantees 100% capture of the highest-criticality tier regardless of overall sampling rate.

4.2. A conforming system MUST validate that the active sampling strategy does not exhibit systematic correlation with event criticality, event timing, event source, entity cardinality, or error characteristics, through documented bias analysis performed at initial deployment and after every change to the sampling configuration or the agent's operational profile.

4.3. A conforming system MUST ensure that when cardinality limits, bucketing, or aggregation are applied to telemetry metrics, anomaly detection capability is preserved for the aggregated ("other" or "overflow") population, either through separate monitoring of the aggregated bucket or through periodic full-cardinality sampling windows.

4.4. A conforming system MUST define minimum statistical significance thresholds for each critical event class, specifying the minimum number of sampled events per reporting period required to detect a defined magnitude of change (e.g., a twofold increase) with a specified confidence level (e.g., 95%), and MUST configure sampling rates to meet these thresholds.

4.5. A conforming system MUST implement sampling coverage validation that periodically compares the distribution of sampled events against a full-capture baseline (obtained through periodic 100% capture windows or statistical estimation) to detect sampling drift or emergent bias.

4.6. A conforming system MUST retain unsampled metadata for dropped events — at minimum, event type, criticality tier, timestamp, and source identifier — to enable post-hoc reconstruction of the true event distribution even when full event payloads are not retained.

4.7. A conforming system SHOULD implement adaptive sampling that increases capture rates for event classes whose sampled counts fall below the minimum statistical significance threshold, ensuring that rare event classes are oversampled relative to their natural frequency.

4.8. A conforming system SHOULD apply stratified sampling across agent instances, geographic regions, user segments, and operational contexts, ensuring that no stratum is systematically under-represented in the sampled dataset.

4.9. A conforming system SHOULD implement real-time sampling bias indicators on operational dashboards, showing the current effective sampling rate for each criticality tier and flagging strata where sampled counts are below statistical significance thresholds.

4.10. A conforming system MAY implement shadow telemetry pipelines that process a parallel full-capture stream for critical event classes, independent of the primary sampled pipeline, providing a secondary detection path for events that sampling might miss.

5. Rationale

Telemetry sampling is an economic necessity for high-throughput AI agent deployments. An agent processing millions of interactions per month generates terabytes of telemetry data. Retaining, indexing, and querying 100% of this data is prohibitively expensive for most organisations — storage costs, query latency, and pipeline throughput all degrade as volume increases. Sampling is the standard engineering solution: capture a representative fraction, discard the rest, and make operational decisions based on the sample. This approach works well for high-frequency, low-consequence events. It fails catastrophically for low-frequency, high-consequence events.

The fundamental problem is that the events organisations most need to observe are precisely the events that sampling is most likely to miss. Safety violations, compliance breaches, and novel failure modes are rare by definition — if they were frequent, they would have been caught and fixed. A 1% sampling rate applied uniformly means that a failure occurring once per 10,000 interactions produces, on average, one sampled instance per million interactions. At 4 million interactions per month, that is approximately 4 captured instances per month — a number so small that statistical noise dominates any trend signal. The failure could triple in frequency and the change would be invisible in the sampled data.

This is not a theoretical concern. The observability engineering community has documented the "streetlight effect" in production telemetry: organisations observe what their sampling captures and remain blind to what it misses. Dashboards show green because the sampled data is green — but the sampled data is not representative. The bias is systematic and invisible: the telemetry pipeline faithfully reports what it captures, and what it captures is biased toward common, successful, uneventful interactions.

Three specific bias mechanisms require governance. First, frequency bias: uniform sampling inherently over-represents frequent events and under-represents rare events. For a uniform 1% sample, an event occurring 1 million times per day contributes approximately 10,000 samples — a highly reliable signal. An event occurring 100 times per day contributes approximately 1 sample — useless for trend detection. Second, temporal bias: time-based sampling strategies can correlate with the agent's internal processing order, systematically capturing events that occur at specific points in the processing cycle while missing events at other points. Error handling, timeout processing, and fallback logic often execute at different times than the primary processing path. Third, cardinality bias: aggregation strategies that collapse low-volume entities into "other" buckets systematically hide failures affecting long-tail entities. In a system monitoring 340 currency pairs where only 50 are tracked individually, failures affecting the other 290 are invisible.

Regulatory frameworks increasingly require demonstrable observability of AI system behaviour. The EU AI Act Article 12 mandates logging capabilities that enable monitoring throughout the system's lifetime. If the logging system is biased — if it systematically misses certain failure classes — it does not satisfy this requirement regardless of its technical sophistication. Similarly, SOX Section 404 requires that internal controls be effective. A monitoring control that is blind to a class of financial errors due to sampling bias is not an effective control. The governance of sampling bias is therefore not merely an operational optimisation concern — it is a compliance requirement wherever regulators mandate demonstrable monitoring of AI system behaviour.

The interaction with AG-409 (Critical Event Taxonomy Governance) is direct: AG-409 defines which events are critical, and AG-417 ensures those critical events are not lost to sampling. Without AG-409, there is no classification to drive tier-aware sampling. Without AG-417, the classification exists but the sampling pipeline ignores it. The interaction with AG-022 (Behavioural Drift Detection) is equally direct: drift detection algorithms operate on sampled telemetry. If the sampling is biased, the drift detection is biased — it will detect drift in well-sampled event classes but miss drift in under-sampled classes.

6. Implementation Guidance

Telemetry Sampling Bias Governance requires organisations to treat their sampling strategy as a governed artefact — not merely an infrastructure configuration parameter. The sampling rate, stratification criteria, cardinality limits, and aggregation rules collectively determine what the organisation can and cannot observe. Changes to these parameters change the organisation's observability posture and must be assessed for their impact on critical event visibility.

Recommended patterns:

Tier-aware sampling with guaranteed critical capture. Implement a multi-tier sampling architecture where events classified as critical (per AG-409) are captured at 100% regardless of the overall sampling rate. Non-critical events are sampled at the configured rate (e.g., 1%, 5%, 10%). The tier classification is applied at the earliest point in the telemetry pipeline — ideally at the event generation layer within the agent runtime — to prevent critical events from being dropped by upstream sampling. This pattern increases total telemetry volume relative to uniform sampling but only by the volume of critical events, which by definition are rare.
Statistical power analysis for sampling configuration. Before setting sampling rates, perform a statistical power analysis for each critical event class. Given the expected base rate of the event class and the minimum change magnitude that must be detectable (e.g., a twofold increase), calculate the minimum number of sampled events required per reporting period to achieve the desired confidence level. Configure sampling rates to meet this minimum. For example, if a safety violation occurs at 0.03% frequency in a population of 4 million monthly interactions (1,200 expected violations), detecting a twofold increase with 95% confidence requires approximately 80 sampled violations per month, implying a minimum 6.7% sampling rate for that event class.
Periodic full-capture baseline windows. Schedule regular full-capture windows (e.g., 15 minutes every 24 hours, or one full hour per week) where 100% of telemetry events are retained. Use these baselines to validate that the sampled distribution matches the full distribution. Compute bias metrics: for each event class, compare the sampled frequency to the baseline frequency. Flag any event class where the sampled frequency deviates from the baseline frequency by more than a defined threshold (e.g., 20%). These baselines also serve as training data for adaptive sampling algorithms.
Cardinality-aware anomaly detection. When cardinality limits collapse entities into "other" buckets, implement dedicated anomaly detection on the aggregated bucket. Monitor the "other" bucket for changes in volume, error rate, latency, or other quality metrics. If the "other" bucket shows anomalous behaviour, automatically expand the cardinality window to identify which specific entities within the bucket are responsible. This approach maintains the cost benefits of cardinality limits while preserving the ability to detect long-tail failures.
Unsampled event metadata retention. For every event that is dropped by sampling, retain a minimal metadata record: event type, criticality tier, timestamp, source identifier, and a hash of the event payload. This metadata stream is orders of magnitude smaller than the full event stream and enables post-hoc reconstruction of the true event distribution. If an incident investigation reveals that a critical event class was under-sampled, the metadata records can confirm how many events actually occurred.

Anti-patterns to avoid:

Uniform sampling without criticality awareness. Applying the same sampling rate to all events regardless of their criticality or consequence profile. This is the most common and most dangerous sampling bias — it guarantees that rare critical events are under-represented in the sampled dataset.
Sampling rate set by cost alone. Choosing sampling rates based solely on storage budget constraints without analysing the statistical impact on critical event detection. A 1% sampling rate that fits the storage budget but renders safety violations statistically invisible is not an acceptable engineering trade-off.
Head-of-pipeline sampling. Applying sampling at the collection layer before event classification. If events are sampled before their criticality tier is determined, critical events are dropped before they can be identified for guaranteed capture. Sampling must occur after classification, not before.
Treating aggregation as neutral. Assuming that cardinality limits, rollups, and bucketing do not introduce bias. Every aggregation operation destroys information. The governance question is whether the destroyed information includes signals needed for critical event detection.
One-time sampling validation. Validating the sampling strategy at deployment and never revisiting it. Agent behaviour evolves, new failure modes emerge, traffic patterns shift, and the relationship between sampling parameters and critical event visibility changes over time. Sampling validation must be periodic and continuous.

Industry Considerations

Financial Services. Financial agents processing transactions, valuations, and compliance checks generate telemetry where even rare errors have direct monetary impact. A pricing error affecting 0.01% of transactions in a high-volume trading agent can accumulate millions in losses. Sampling strategies must guarantee statistical significance for every financial error category. FCA expectations for trade surveillance and transaction monitoring require that monitoring systems can detect market abuse patterns — patterns that may affect a small fraction of total transactions.

Healthcare and Life Sciences. Clinical decision support agents must capture every adverse event indicator at 100%. Sampling strategies that might miss a drug interaction warning or misdiagnosis signal are unacceptable. Regulatory frameworks for medical devices require complete event capture for safety-critical categories.

Crypto and Web3. Decentralised agent deployments often operate across multiple chains and protocols with highly variable transaction volumes. Sampling must account for the extreme variance in event frequency across chains and the possibility of targeted manipulation on low-volume chains where sampled coverage is weakest.

Safety-Critical and CPS. Embodied agents and cyber-physical systems require deterministic telemetry capture for all safety-relevant events. Sampling is generally inappropriate for safety-critical event classes — these must be captured at 100% with guaranteed delivery to the monitoring system.

Maturity Model

Basic Implementation — The organisation has classified telemetry events into criticality tiers aligned with AG-409. Critical events are captured at 100% regardless of overall sampling rate. A documented bias analysis exists for the current sampling configuration. Minimum statistical significance thresholds are defined for each critical event class. Unsampled event metadata is retained. This level meets the mandatory requirements and prevents the most severe sampling bias failures.

Intermediate Implementation — All basic capabilities plus: periodic full-capture baseline windows validate the sampled distribution against reality. Cardinality-aware anomaly detection monitors collapsed "other" buckets for hidden failures. Sampling coverage validation runs automatically and flags drift between sampled and baseline distributions. Stratified sampling ensures representation across agent instances, regions, and user segments. Real-time dashboard indicators show effective sampling rates per criticality tier.

Advanced Implementation — All intermediate capabilities plus: adaptive sampling dynamically adjusts capture rates for event classes falling below statistical significance thresholds. Shadow telemetry pipelines provide independent full-capture monitoring for critical event classes. The organisation can demonstrate through testing that no known failure mode is systematically hidden by the sampling strategy. Sampling bias metrics are included in the regular governance reporting cycle and reviewed by the governance board.

7. Evidence Requirements

Required artefacts:

Telemetry event criticality classification. Documentation mapping all telemetry event types to criticality tiers, aligned with AG-409 taxonomy, with the sampling treatment defined for each tier (100% capture, stratified sampling rate, or standard sampling rate).
Sampling bias analysis. The most recent bias analysis report documenting the assessment of the sampling strategy against each bias vector: frequency bias, temporal bias, cardinality bias, and source-correlation bias. Must include the statistical methodology used and the conclusions.
Statistical significance configuration. Documentation specifying the minimum statistical significance thresholds for each critical event class, the power analysis methodology, the assumed base rates, and the resulting minimum sampling rates.
Sampling coverage validation results. Results from the most recent sampling coverage validation comparing the sampled distribution against a full-capture baseline. Must show the deviation for each event class and flag any deviations exceeding defined thresholds.
Unsampled metadata retention records. Evidence that unsampled event metadata is being retained, including the metadata schema, storage location, retention period, and a sample query demonstrating that the metadata can be used to reconstruct the true event distribution.
Critical event capture verification. Evidence confirming that events classified in the highest criticality tier are captured at 100% — typically demonstrated through comparison of critical event counts in the sampled pipeline against critical event counts in agent runtime logs.

Retention requirements:

Bias analysis reports and sampling configuration records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.
Unsampled event metadata: minimum retention period matching the retention requirement for the corresponding event class under AG-410.
Full-capture baseline datasets: minimum 12 months of baseline windows for trend comparison.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Evidence must exist as retained artefacts, not be reconstructable after the fact. Sampling configuration history must be queryable by date to determine the active sampling parameters at any historical point.

8. Test Specification

Test 8.1: Critical Event Guaranteed Capture

Stimulus: Generate 500 synthetic critical events (as classified under AG-409) and 50,000 non-critical events through the production telemetry pipeline with the standard sampling configuration active. Count the number of critical and non-critical events that appear in the sampled telemetry store.
Expected behaviour: All 500 critical events are captured. Non-critical events are captured at approximately the configured sampling rate.
Pass criteria: 100% of critical events (500 of 500) are present in the sampled store. Non-critical event capture rate is within 10% of the configured sampling rate.
Fail criteria: Any critical event is missing from the sampled store, or the non-critical capture rate deviates from the configured rate by more than 20%.

Test 8.2: Sampling Bias Independence Verification

Stimulus: Inject a synthetic workload containing events distributed across three dimensions: (a) 10 event types at varying frequencies (from 1 per 1,000 to 1 per 1,000,000), (b) events occurring at every 100-millisecond interval within each second, and (c) events tagged with 200 unique entity identifiers. Apply the production sampling strategy. Analyse the sampled dataset for correlation between capture rate and event frequency, event timing, and entity cardinality rank.
Expected behaviour: Capture rate is independent of event timing and entity cardinality rank. High-criticality low-frequency events are captured at their tier-defined rate, not at the lower rate implied by uniform sampling.
Pass criteria: Pearson correlation between capture rate and event timing is below 0.15. Pearson correlation between capture rate and cardinality rank is below 0.15. All criticality-tier capture rates are within 5% of their defined targets.
Fail criteria: Any correlation exceeds 0.30, or any criticality tier's capture rate deviates from its target by more than 15%.

Test 8.3: Statistical Significance Threshold Enforcement

Stimulus: Configure a critical event class with a defined base rate and minimum statistical significance threshold (e.g., minimum 80 sampled events per reporting period to detect a twofold increase at 95% confidence). Run the telemetry pipeline for one reporting period at the expected base rate. Count the number of sampled events for the critical class.
Expected behaviour: The number of sampled events meets or exceeds the minimum statistical significance threshold.
Pass criteria: Sampled count for the critical event class equals or exceeds the defined minimum threshold. If the threshold is 80, at least 80 sampled events are captured.
Fail criteria: Sampled count falls below the minimum threshold, indicating that the sampling rate is insufficient to provide the required statistical power for that event class.

Test 8.4: Cardinality Collapse Anomaly Detection

Stimulus: Configure a metric with a 50-entity cardinality limit. Generate normal traffic across 200 entities, with the top 50 entities by volume receiving individual tracking and the remaining 150 collapsed into "other." Inject an anomaly (e.g., 5x error rate increase) into 3 entities within the collapsed "other" bucket. Allow the system to process traffic for the equivalent of one alerting evaluation window.
Expected behaviour: The anomaly detection system identifies the anomalous behaviour in the "other" bucket and triggers investigation. Cardinality expansion identifies the 3 affected entities.
Pass criteria: An alert fires for the "other" bucket anomaly within one alerting evaluation window. The system identifies at least 2 of the 3 affected entities upon cardinality expansion.
Fail criteria: No alert fires for the "other" bucket, or cardinality expansion fails to identify any of the affected entities.

Test 8.5: Unsampled Metadata Retention and Reconstruction

Stimulus: Process 100,000 events through the telemetry pipeline with the standard sampling configuration. For the events that are dropped by sampling, verify that metadata records are retained. Use the metadata to reconstruct the true event distribution by type and criticality tier. Compare the reconstructed distribution against the known input distribution.
Expected behaviour: Metadata records exist for all dropped events. The reconstructed distribution from metadata matches the known input distribution.
Pass criteria: Metadata records cover at least 99% of dropped events. The reconstructed distribution matches the known input distribution with less than 2% deviation per event type.
Fail criteria: Metadata records cover fewer than 95% of dropped events, or the reconstructed distribution deviates from the known input by more than 5% for any event type.

Test 8.6: Sampling Coverage Validation Against Baseline

Stimulus: Execute a full-capture baseline window capturing 100% of events for a defined period. Compare the distribution of event types in the sampled telemetry (from normal sampling during the same period or an adjacent period) against the full-capture baseline.
Expected behaviour: The sampled distribution matches the baseline distribution within defined tolerance for each event class, after adjusting for the expected sampling rate.
Pass criteria: For each critical event class, the ratio of sampled frequency to baseline frequency (adjusted for sampling rate) is between 0.85 and 1.15. No critical event class present in the baseline is entirely absent from the sampled dataset.
Fail criteria: Any critical event class shows a sampled-to-baseline ratio outside the 0.70-1.30 range, or any critical event class present in the baseline is absent from the sampled dataset.

Test 8.7: Tier-Aware Sampling Configuration Enforcement

Stimulus: Attempt to deploy a sampling configuration that applies a uniform sampling rate to all event types without criticality-tier differentiation. Verify that the deployment pipeline or configuration validation rejects the configuration.
Expected behaviour: The system rejects sampling configurations that do not implement tier-aware sampling. A configuration without explicit 100% capture for the highest criticality tier is flagged as non-conforming.
Pass criteria: The non-conforming configuration is rejected or flagged before deployment. An explicit error or warning references the requirement for tier-aware sampling and 100% critical event capture.
Fail criteria: The non-conforming configuration is deployed without rejection or warning.

Conformance Scoring

Score 0: No sampling bias governance exists — the sampling strategy is configured solely based on cost or throughput constraints with no analysis of its impact on critical event visibility. Critical events are sampled at the same rate as routine events.
Score 1: Critical events are identified and captured at 100%, but no systematic bias analysis has been performed. Minimum statistical significance thresholds are not defined. Unsampled event metadata is not retained. Cardinality-collapse risks are unaddressed.
Score 2: All mandatory requirements are met — tier-aware sampling with 100% critical capture, documented bias analysis, statistical significance thresholds, cardinality-aware anomaly detection, unsampled metadata retention, and periodic sampling coverage validation against full-capture baselines.
Score 3: Verified through independent audit — an independent party has validated sampling bias independence, confirmed statistical significance thresholds are met in production, and verified that no known critical event class is systematically under-represented. Adaptive sampling dynamically adjusts capture rates. Shadow pipelines provide independent monitoring for critical events.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 12 (Record-Keeping / Logging)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
SOX	Section 404 (Internal Controls Over Financial Reporting)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
NIST AI RMF	MEASURE 2.1, MEASURE 2.6	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis)	Direct requirement
DORA	Article 10 (Detection)	Direct requirement

EU AI Act — Article 12 (Record-Keeping / Logging)

Article 12 requires that high-risk AI systems include logging capabilities that enable monitoring of the system's operation throughout its lifetime. The logging must be adequate to enable post-market monitoring and to facilitate the identification of risks. Telemetry sampling that systematically misses certain failure classes directly undermines this requirement — the logging exists but is biased, creating an illusion of adequate monitoring while systematically under-observing critical events. Organisations must demonstrate that their sampling strategies preserve the ability to detect the full range of operational failures, not merely the frequent ones. A biased sampling configuration that renders safety violations statistically invisible is a non-conforming logging implementation regardless of the volume of data it does capture.

SOX — Section 404 (Internal Controls Over Financial Reporting)

Financial transaction monitoring controls that rely on sampled telemetry must demonstrate that the sampling provides adequate coverage of financial error categories. An auditor assessing internal controls over financial reporting will evaluate whether the monitoring controls can detect material misstatements. If the sampling strategy renders a class of financial errors statistically undetectable — because the error is rare and the sampling rate is too low — the monitoring control is ineffective for that error class, potentially constituting a significant deficiency or material weakness.

FCA SYSC — 6.1.1R (Systems and Controls)

The FCA requires firms to maintain adequate systems and controls for the management of their business. For AI agents operating in financial services, telemetry monitoring is a key control. The FCA has emphasised that monitoring must be effective in practice, not merely documented in policy. A sampling strategy that systematically under-observes certain failure modes — particularly those affecting consumer outcomes or market integrity — does not constitute adequate monitoring. Firms must demonstrate that their sampling provides statistically meaningful coverage of all risk-relevant event classes.

NIST AI RMF — MEASURE 2.1, MEASURE 2.6

MEASURE 2.1 addresses whether AI system performance is evaluated regularly, and MEASURE 2.6 addresses operational monitoring. Both functions depend on the quality of the data used for measurement and monitoring. Biased telemetry sampling produces biased performance evaluations — the organisation believes the system is performing well because its monitoring data, shaped by biased sampling, supports that conclusion. AG-417 ensures that the data feeding NIST AI RMF measurement functions accurately represents the system's true operational behaviour.

ISO 42001 — Clause 9.1 (Monitoring, Measurement, Analysis)

ISO 42001 Clause 9.1 requires organisations to determine what needs to be monitored and measured, and when monitoring and measuring shall be performed. AG-417 addresses a prerequisite for effective monitoring: that the data collected through monitoring is representative and unbiased. An organisation conforming to ISO 42001 that relies on biased telemetry has a gap between its monitoring intent (what it believes it is monitoring) and its monitoring reality (what the biased sample actually observes).

DORA — Article 10 (Detection)

DORA Article 10 requires financial entities to have mechanisms to promptly detect anomalous activities. Telemetry sampling bias directly impacts detection capability. If the sampling strategy systematically under-samples the event classes where anomalies are most likely to manifest — rare errors, timeout patterns, fallback invocations — the detection mechanism fails. Organisations must demonstrate that their detection capabilities extend to all relevant event classes, including those that are rare in normal operation.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Organisation-wide — affects the validity of all observability, alerting, and monitoring systems that consume sampled telemetry, potentially masking failures across every deployed agent

Consequence chain: Biased telemetry sampling creates a systematic blind spot in the organisation's observability posture. The immediate failure is statistical: critical event classes are under-represented in the sampled dataset, meaning dashboards, alerts, and reports based on sampled data are misleading. The operational consequence is delayed or missed detection of safety violations, compliance breaches, financial errors, and novel failure modes. Operators see green dashboards and conclude that agents are healthy, while unseen failures accumulate. The business consequence compounds over time: each day of undetected failure increases the remediation cost, the regulatory exposure, and the customer harm. When the failure is eventually discovered — typically through a customer complaint, an audit finding, or an incident severe enough to be noticed despite the sampling bias — the organisation faces both the direct cost of the accumulated failures and the meta-failure of demonstrating that its monitoring was inadequate. In regulated sectors, this meta-failure — the inability to demonstrate effective monitoring — is itself a compliance violation, triggering regulatory action independent of the underlying failures that went undetected. The interaction with AG-022 (Behavioural Drift Detection) amplifies the consequence: if drift detection operates on biased telemetry, it cannot detect drift in under-sampled event classes, meaning the organisation's drift detection and sampling bias create mutually reinforcing blind spots.

Cross-references: AG-409 (Critical Event Taxonomy Governance) provides the event classification that drives tier-aware sampling. AG-022 (Behavioural Drift Detection) consumes sampled telemetry and is directly affected by sampling bias. AG-410 (High-Cardinality Trace Retention Governance) addresses retention of high-cardinality trace data that AG-417 ensures is adequately sampled. AG-413 (Observer-of-Observer Integrity Governance) monitors the health of the observability pipeline itself, including sampling components. AG-414 (Alert Deduplication Governance) operates downstream of sampling — alerts can only fire on events that survive sampling. AG-418 (Cross-System Trace Correlation Governance) requires that correlated events across systems are consistently sampled to maintain trace completeness.

Cite this protocol

AgentGoverning. (2026). AG-417: Telemetry Sampling Bias Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-417

← Previous Protocol

AG-416

Evidentiary Chain-of-Custody Governance

Next Protocol →

AG-418

Cross-System Trace Correlation Governance