Assurance Sampling Governance requires that organisations define statistically and operationally sound sampling strategies for ongoing control testing. No governance framework can test every evidence object, every agent action, or every control invocation exhaustively — the volumes are too high. Sampling is therefore inevitable, but undisciplined sampling produces unreliable assurance. The sampling strategy MUST define sample sizes based on statistical confidence requirements, specify selection methods that avoid bias, adjust sampling intensity based on risk and control history, and document the confidence level and margin of error that the sampling provides. Without governed sampling, assurance testing may use samples that are too small, biased, or unrepresentative, producing conclusions that do not reflect the actual control population.
Scenario A — Convenience Sampling Misses Systematic Failure: An assessor tests AG-001 (Operational Boundary Enforcement) by reviewing 25 blocked action records selected from the most recent 48 hours. All 25 records demonstrate correct enforcement. The assessor concludes that AG-001 is operating effectively. However, the 25 records are all from weekday business hours when transaction volumes are moderate and the enforcement gateway has ample capacity. During weekend batch processing (which generates 60% of all transactions), the enforcement gateway experiences resource contention and intermittently fails to block over-limit transactions. The weekend failure pattern has existed for 4 months and has allowed 347 over-limit transactions totalling £2.3 million. The convenience sample — recent, business-hours-only — systematically excluded the failure window.
What went wrong: The sample was selected by convenience (most recent, easily accessible) rather than by a method designed to represent the full population. The sample excluded weekend operations, which constitute the majority of transaction volume and contain the failure pattern. Consequence: 4 months of undetected enforcement failure, £2.3 million in over-limit transactions, false assurance in the assessment report.
Scenario B — Statistically Inadequate Sample Size Fails to Detect Deficiency Rate: An organisation has 50,000 evidence objects from AG-007 (Governance Configuration Control) covering the assessment period. The assessor reviews 10 objects and finds no deficiencies. The assessor reports "no deficiencies found, control operating effectively." The actual deficiency rate in the population is 3% (1,500 deficient objects). With a sample of 10 from a population of 50,000, the probability of detecting a 3% deficiency rate is approximately 26% — there is a 74% probability that the sample contains zero deficient objects even though the population contains 1,500. The assessor's conclusion is statistically unsupported.
What went wrong: The sample size was not calculated based on statistical requirements. A sample of 10 from 50,000 provides no meaningful assurance about deficiency rates. To detect a 3% deficiency rate with 95% confidence, the sample would need to be approximately 100 objects. Consequence: False assurance — a 3% deficiency rate goes undetected, 1,500 deficient configuration records persist.
Scenario C — Fixed Sampling Ignores Risk Differentiation: An organisation uses the same sampling rate (5% of evidence objects) for all controls regardless of risk. For a low-risk control producing 1,000 evidence objects per year, this means 50 samples — adequate. For a Critical control producing 500,000 evidence objects per year, this also means 25,000 samples — far more than needed for statistical confidence, consuming disproportionate assessment resources. Meanwhile, for a newly implemented Critical control with only 200 evidence objects, the 5% rate yields 10 samples — statistically inadequate for a Critical control. The fixed-rate approach misallocates resources: too many samples where risk is low, too few where risk is high and population is small.
What went wrong: The sampling strategy did not differentiate based on risk level, control criticality, or population size. A fixed percentage is not a sound sampling strategy because it does not account for the relationship between sample size, population size, and desired confidence level. Consequence: Assessment resources wasted on over-sampling low-risk controls, inadequate assurance for Critical controls with small populations.
Scope: This dimension applies to all ongoing control testing and conformance assessment activities conducted under the Agent Governance Standard. "Sampling" includes any process where a subset of a population (evidence objects, agent actions, configuration records, test results) is selected to represent the whole population for assessment purposes. The scope covers the definition of the sampling strategy, the calculation of sample sizes, the selection methodology, the documentation of confidence parameters, and the adjustment of sampling based on risk and historical performance. The scope does not extend to exhaustive testing requirements specified in individual control test specifications (e.g., specific adversarial tests that must be run against every agent); those are exhaustive by design and are not subject to sampling.
4.1. A conforming system MUST define a documented sampling strategy for each control subject to ongoing testing, specifying: the population definition, the sample size calculation method, the selection method, the confidence level, and the margin of error.
4.2. A conforming system MUST calculate sample sizes using accepted statistical methods, achieving at minimum 95% confidence level with a margin of error not exceeding 5% for controls rated High or Critical severity, and 90% confidence with 10% margin of error for controls rated Medium or Low.
4.3. A conforming system MUST use selection methods that avoid systematic bias — at minimum, simple random sampling, stratified random sampling, or systematic sampling with a random start point. Convenience sampling, judgmental sampling, and most-recent-only sampling are not acceptable as primary methods.
4.4. A conforming system MUST adjust sampling intensity based on risk: higher-risk controls and controls with a history of deficiencies receive larger samples and more frequent testing than lower-risk controls with clean histories.
4.5. A conforming system MUST document the confidence parameters of each sampling result — the confidence level, margin of error, population size, sample size, and any stratification applied — alongside the assessment findings.
4.6. A conforming system MUST increase sampling intensity when a deficiency is detected in a sample, expanding the sample to determine the extent and pattern of the deficiency before concluding the assessment.
4.7. A conforming system SHOULD implement stratified sampling for populations with known subgroups (e.g., different time periods, different agent profiles, different operating modes), ensuring each subgroup is represented proportionally or over-represented based on risk.
4.8. A conforming system SHOULD automate sample selection and extraction to eliminate human selection bias and reduce the effort of evidence collection.
4.9. A conforming system MAY implement adaptive sampling — dynamically adjusting sample sizes within an assessment based on emerging results, increasing samples when early results suggest elevated deficiency rates and decreasing when early results suggest low rates.
Sampling is a statistical discipline, not a casual process. The purpose of sampling in governance assurance is to draw conclusions about a population (all evidence objects, all agent actions, all control invocations) by examining a subset. The conclusions are only valid if the subset is selected and sized according to statistical principles.
Three problems arise without sampling governance. First, inadequate sample sizes produce unreliable conclusions. A sample of 10 from a population of 100,000 tells the assessor almost nothing about the population's characteristics. Assessors without statistical training commonly select samples that feel adequate but are statistically meaningless. Second, biased selection methods produce unrepresentative samples. Convenience sampling (selecting the most recent, most accessible, or most visible evidence) systematically excludes edge cases, failure windows, and anomalous periods — exactly the items most relevant to assurance. Third, uniform sampling across controls misallocates assurance resources. Critical controls require more intensive sampling than low-risk controls, and controls with deficiency histories require more intensive sampling than controls with clean records.
AG-227 establishes sampling governance as a meta-governance function — ensuring that the sampling underlying all assurance activities is statistically sound, operationally appropriate, and risk-differentiated. This is particularly important for AI agent governance, where evidence volumes can be massive (millions of action records per year) and failure patterns may be concentrated in specific operating modes, time windows, or agent configurations that convenience sampling would miss.
The sampling strategy should be documented as a structured artefact (sampling plan) for each control subject to ongoing testing. The plan should specify: population definition, population size (or estimated range), desired confidence level, desired margin of error, calculated sample size, selection method, stratification criteria (if applicable), and any adjustments for risk or deficiency history.
Recommended patterns:
Anti-patterns to avoid:
Basic Implementation — Documented sampling plans exist for all material controls subject to ongoing testing. Sample sizes are calculated using accepted statistical methods. Selection uses random or systematic methods, not convenience sampling. Confidence parameters are documented alongside assessment findings. Sampling intensity varies by control risk level.
Intermediate Implementation — Sampling plans are integrated with the evidence schema (AG-221) for automated sample extraction. Stratified sampling is used for populations with known subgroups. Deficiency-triggered sample expansion is implemented. Sample size calculators are standardised across all assessments. Sampling plans are reviewed and updated annually based on population changes and deficiency history.
Advanced Implementation — All intermediate capabilities plus: adaptive sampling dynamically adjusts sample sizes within assessments based on emerging results. Automated sampling and analysis tools process large evidence populations with minimal manual effort. Sampling effectiveness is validated periodically by comparing sample-based conclusions against exhaustive analysis of selected populations. Cross-organisation sampling standards enable consistent assurance across supply chains.
Required artefacts:
Retention requirements:
Access requirements:
Test 8.1: Sample Size Statistical Validity
Test 8.2: Selection Method Bias Prevention
Test 8.3: Risk-Based Sampling Intensity Differentiation
Test 8.4: Deficiency-Triggered Sample Expansion
Test 8.5: Confidence Parameter Documentation
Test 8.6: Stratification for Heterogeneous Populations
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 9 (Risk Management — Testing) | Supports compliance |
| ISO 42001 | Clause 9.1 (Monitoring, Measurement, Analysis) | Supports compliance |
| ISA 530 | Audit Sampling | Direct requirement |
| PCAOB AS 2315 | Audit Sampling | Direct requirement |
| FCA SYSC | 6.1.1R (Adequate Systems and Controls) | Supports compliance |
| NIST AI RMF | MEASURE 2.5 (Evaluation — Testing Approaches) | Supports compliance |
| ISO 2859-1 | Sampling Procedures for Inspection by Attributes | Supports compliance |
ISA 530 (International Standard on Auditing) and PCAOB AS 2315 establish the requirements for audit sampling in financial audits. These standards require: defining the objective of the test, determining the tolerable error rate, determining the expected error rate, calculating the sample size to achieve the desired confidence, using appropriate selection methods, and evaluating results with statistical rigour. AG-227 applies these established audit sampling principles to AI governance assurance, ensuring that the sampling underlying control assessments meets the same standards as financial audit sampling.
Clause 9.1 requires organisations to determine what needs to be monitored and measured, and when. For ongoing control testing, this includes determining how to sample from large evidence populations to draw reliable conclusions. AG-227 provides the methodological framework for this sampling.
ISO 2859-1 defines sampling procedures for inspection by attributes — determining the acceptability of a population based on a sample. AG-227's requirements for sample size calculation, selection methods, and deficiency evaluation align with ISO 2859-1's established procedures, providing a familiar methodological framework for assessors with quality management backgrounds.
| Field | Value |
|---|---|
| Severity Rating | Medium |
| Blast Radius | Organisation-wide — affects the reliability of all sampling-based assurance conclusions |
Consequence chain: Without sampling governance, assurance conclusions drawn from testing are statistically unsupported. The immediate failure mode is unreliable assessment — assessors draw conclusions from samples that are too small, biased, or unrepresentative. The downstream consequence is false assurance or missed deficiencies: either the assessment concludes "no deficiencies" when the sample was inadequate to detect them, or the assessment detects deficiencies but the sample provides no information about their extent. The ultimate business consequence is governance decisions based on unreliable information — continuing to operate controls that are actually deficient, or remediating isolated incidents when the underlying pattern is systematic. While sampling governance failure does not directly create operational risk (unlike, say, AG-001 failure), it undermines the assurance infrastructure that detects and prevents operational risk, making it a foundational concern.
Cross-references: AG-221 (Assurance Evidence Schema Governance) provides the structured evidence that sampling selects from. AG-226 (Independent Audit Challenge Governance) uses sampling methodologies governed by AG-227. AG-056 (Independent Validation) applies sampling during validation activities. AG-157 (External Conformance Assessment) uses sampling during external assessments. AG-153 (Control Efficacy Measurement) consumes sampling-based assessment results as inputs to efficacy measurement.