AG-227: Assurance Sampling Governance

2. Summary

Assurance Sampling Governance requires that organisations define statistically and operationally sound sampling strategies for ongoing control testing. No governance framework can test every evidence object, every agent action, or every control invocation exhaustively — the volumes are too high. Sampling is therefore inevitable, but undisciplined sampling produces unreliable assurance. The sampling strategy MUST define sample sizes based on statistical confidence requirements, specify selection methods that avoid bias, adjust sampling intensity based on risk and control history, and document the confidence level and margin of error that the sampling provides. Without governed sampling, assurance testing may use samples that are too small, biased, or unrepresentative, producing conclusions that do not reflect the actual control population.

3. Example

Scenario A — Convenience Sampling Misses Systematic Failure: An assessor tests AG-001 (Operational Boundary Enforcement) by reviewing 25 blocked action records selected from the most recent 48 hours. All 25 records demonstrate correct enforcement. The assessor concludes that AG-001 is operating effectively. However, the 25 records are all from weekday business hours when transaction volumes are moderate and the enforcement gateway has ample capacity. During weekend batch processing (which generates 60% of all transactions), the enforcement gateway experiences resource contention and intermittently fails to block over-limit transactions. The weekend failure pattern has existed for 4 months and has allowed 347 over-limit transactions totalling £2.3 million. The convenience sample — recent, business-hours-only — systematically excluded the failure window.

What went wrong: The sample was selected by convenience (most recent, easily accessible) rather than by a method designed to represent the full population. The sample excluded weekend operations, which constitute the majority of transaction volume and contain the failure pattern. Consequence: 4 months of undetected enforcement failure, £2.3 million in over-limit transactions, false assurance in the assessment report.

Scenario B — Statistically Inadequate Sample Size Fails to Detect Deficiency Rate: An organisation has 50,000 evidence objects from AG-007 (Governance Configuration Control) covering the assessment period. The assessor reviews 10 objects and finds no deficiencies. The assessor reports "no deficiencies found, control operating effectively." The actual deficiency rate in the population is 3% (1,500 deficient objects). With a sample of 10 from a population of 50,000, the probability of detecting a 3% deficiency rate is approximately 26% — there is a 74% probability that the sample contains zero deficient objects even though the population contains 1,500. The assessor's conclusion is statistically unsupported.

What went wrong: The sample size was not calculated based on statistical requirements. A sample of 10 from 50,000 provides no meaningful assurance about deficiency rates. To detect a 3% deficiency rate with 95% confidence, the sample would need to be approximately 100 objects. Consequence: False assurance — a 3% deficiency rate goes undetected, 1,500 deficient configuration records persist.

Scenario C — Fixed Sampling Ignores Risk Differentiation: An organisation uses the same sampling rate (5% of evidence objects) for all controls regardless of risk. For a low-risk control producing 1,000 evidence objects per year, this means 50 samples — adequate. For a Critical control producing 500,000 evidence objects per year, this also means 25,000 samples — far more than needed for statistical confidence, consuming disproportionate assessment resources. Meanwhile, for a newly implemented Critical control with only 200 evidence objects, the 5% rate yields 10 samples — statistically inadequate for a Critical control. The fixed-rate approach misallocates resources: too many samples where risk is low, too few where risk is high and population is small.

What went wrong: The sampling strategy did not differentiate based on risk level, control criticality, or population size. A fixed percentage is not a sound sampling strategy because it does not account for the relationship between sample size, population size, and desired confidence level. Consequence: Assessment resources wasted on over-sampling low-risk controls, inadequate assurance for Critical controls with small populations.

4. Requirement Statement

Scope: This dimension applies to all ongoing control testing and conformance assessment activities conducted under the Agent Governance Standard. "Sampling" includes any process where a subset of a population (evidence objects, agent actions, configuration records, test results) is selected to represent the whole population for assessment purposes. The scope covers the definition of the sampling strategy, the calculation of sample sizes, the selection methodology, the documentation of confidence parameters, and the adjustment of sampling based on risk and historical performance. The scope does not extend to exhaustive testing requirements specified in individual control test specifications (e.g., specific adversarial tests that must be run against every agent); those are exhaustive by design and are not subject to sampling.

4.1. A conforming system MUST define a documented sampling strategy for each control subject to ongoing testing, specifying: the population definition, the sample size calculation method, the selection method, the confidence level, and the margin of error.

4.2. A conforming system MUST calculate sample sizes using accepted statistical methods, achieving at minimum 95% confidence level with a margin of error not exceeding 5% for controls rated High or Critical severity, and 90% confidence with 10% margin of error for controls rated Medium or Low.

4.3. A conforming system MUST use selection methods that avoid systematic bias — at minimum, simple random sampling, stratified random sampling, or systematic sampling with a random start point. Convenience sampling, judgmental sampling, and most-recent-only sampling are not acceptable as primary methods.

4.4. A conforming system MUST adjust sampling intensity based on risk: higher-risk controls and controls with a history of deficiencies receive larger samples and more frequent testing than lower-risk controls with clean histories.

4.5. A conforming system MUST document the confidence parameters of each sampling result — the confidence level, margin of error, population size, sample size, and any stratification applied — alongside the assessment findings.

4.6. A conforming system MUST increase sampling intensity when a deficiency is detected in a sample, expanding the sample to determine the extent and pattern of the deficiency before concluding the assessment.

4.7. A conforming system SHOULD implement stratified sampling for populations with known subgroups (e.g., different time periods, different agent profiles, different operating modes), ensuring each subgroup is represented proportionally or over-represented based on risk.

4.8. A conforming system SHOULD automate sample selection and extraction to eliminate human selection bias and reduce the effort of evidence collection.

4.9. A conforming system MAY implement adaptive sampling — dynamically adjusting sample sizes within an assessment based on emerging results, increasing samples when early results suggest elevated deficiency rates and decreasing when early results suggest low rates.

5. Rationale

Sampling is a statistical discipline, not a casual process. The purpose of sampling in governance assurance is to draw conclusions about a population (all evidence objects, all agent actions, all control invocations) by examining a subset. The conclusions are only valid if the subset is selected and sized according to statistical principles.

Three problems arise without sampling governance. First, inadequate sample sizes produce unreliable conclusions. A sample of 10 from a population of 100,000 tells the assessor almost nothing about the population's characteristics. Assessors without statistical training commonly select samples that feel adequate but are statistically meaningless. Second, biased selection methods produce unrepresentative samples. Convenience sampling (selecting the most recent, most accessible, or most visible evidence) systematically excludes edge cases, failure windows, and anomalous periods — exactly the items most relevant to assurance. Third, uniform sampling across controls misallocates assurance resources. Critical controls require more intensive sampling than low-risk controls, and controls with deficiency histories require more intensive sampling than controls with clean records.

AG-227 establishes sampling governance as a meta-governance function — ensuring that the sampling underlying all assurance activities is statistically sound, operationally appropriate, and risk-differentiated. This is particularly important for AI agent governance, where evidence volumes can be massive (millions of action records per year) and failure patterns may be concentrated in specific operating modes, time windows, or agent configurations that convenience sampling would miss.

6. Implementation Guidance

The sampling strategy should be documented as a structured artefact (sampling plan) for each control subject to ongoing testing. The plan should specify: population definition, population size (or estimated range), desired confidence level, desired margin of error, calculated sample size, selection method, stratification criteria (if applicable), and any adjustments for risk or deficiency history.

Recommended patterns:

Statistical sample size calculator. Implement or adopt a standard sample size calculator that takes as inputs: population size, desired confidence level (e.g., 95%), desired margin of error (e.g., 5%), and expected deficiency rate (conservative estimate). For a population of 50,000 evidence objects at 95% confidence with 5% margin of error and 5% expected deficiency rate, the required sample size is approximately 384 objects. For a population of 500 objects under the same parameters, the required sample is approximately 218. Providing this calculator to assessors eliminates guesswork and ensures consistent sample sizing across assessments.
Stratified sampling by operating mode. For agents that operate in multiple modes (e.g., business hours vs. batch processing, supervised vs. autonomous, normal vs. emergency), stratify the sample to ensure each mode is represented. If weekend batch processing accounts for 60% of transaction volume, 60% of the sample should come from weekend batch processing — not 0% as in the convenience sampling anti-pattern. Risk-based stratification may over-sample high-risk modes: if autonomous mode is 10% of transactions but carries 70% of the risk, it should receive more than 10% of the sample.
Deficiency-triggered sample expansion. When a deficiency is detected in a sample, expand the sample around the deficiency: (1) increase the overall sample size by 50% to improve confidence, (2) stratify additional samples around the deficiency pattern (same time window, same operating mode, same agent), (3) continue sampling until the deficiency rate stabilises. This prevents assessors from detecting a single deficiency and concluding it is an isolated incident when it may be systematic.
Risk-tiered sampling intensity. Define sampling intensity tiers: Critical controls — sample at 95% confidence, 3% margin of error, with quarterly testing. High controls — sample at 95% confidence, 5% margin of error, with semi-annual testing. Medium controls — sample at 90% confidence, 10% margin of error, with annual testing. Low controls — sample at 90% confidence, 10% margin of error, with biennial testing. Adjust upward for controls with recent deficiency findings.

Anti-patterns to avoid:

Convenience sampling as the primary method. Selecting the most recent records, the most accessible records, or records the assessor "knows are good." Convenience sampling introduces systematic bias that invalidates the sample's representativeness.
Fixed percentage sampling. Using a fixed percentage (e.g., "5% of all records") regardless of population size, confidence requirements, or risk level. A 5% sample from a population of 100 (5 objects) is very different from a 5% sample from a population of 1,000,000 (50,000 objects). Sample size should be calculated from confidence and margin-of-error requirements, not as a percentage.
Sample size of "whatever feels right." Assessors without statistical training often select round numbers (10, 25, 50) that feel adequate but have no statistical basis. A sample of 25 from a population of 100,000 has a margin of error of approximately 20% at 95% confidence — meaning the true deficiency rate could be anywhere from 0% to 20% even if the sample shows 0%.
Ignoring population subgroups. Treating a heterogeneous population (multiple agents, multiple operating modes, multiple time periods) as homogeneous. Different subgroups may have different deficiency rates, and a sample drawn from only one subgroup tells nothing about the others.

Maturity Model

Basic Implementation — Documented sampling plans exist for all material controls subject to ongoing testing. Sample sizes are calculated using accepted statistical methods. Selection uses random or systematic methods, not convenience sampling. Confidence parameters are documented alongside assessment findings. Sampling intensity varies by control risk level.

Intermediate Implementation — Sampling plans are integrated with the evidence schema (AG-221) for automated sample extraction. Stratified sampling is used for populations with known subgroups. Deficiency-triggered sample expansion is implemented. Sample size calculators are standardised across all assessments. Sampling plans are reviewed and updated annually based on population changes and deficiency history.

Advanced Implementation — All intermediate capabilities plus: adaptive sampling dynamically adjusts sample sizes within assessments based on emerging results. Automated sampling and analysis tools process large evidence populations with minimal manual effort. Sampling effectiveness is validated periodically by comparing sample-based conclusions against exhaustive analysis of selected populations. Cross-organisation sampling standards enable consistent assurance across supply chains.

7. Evidence Requirements

Required artefacts:

Sampling plans. Documented sampling plans for each control subject to ongoing testing, showing: population definition, sample size calculation, selection method, confidence parameters, and risk-based adjustments.
Sample selection records. Records of actual samples selected during the assessment period, showing: selection method applied, random seed or systematic start point, stratification criteria, and the resulting sample identifiers.
Confidence parameter documentation. For each assessment finding based on sampling, documentation of: confidence level, margin of error, population size, and sample size.
Deficiency expansion records. Where deficiencies were detected in samples, records of the sample expansion performed: additional sample size, stratification of expanded sample, and deficiency rate determination.
Sampling plan review records. Records of periodic sampling plan reviews showing adjustments made based on population changes, deficiency history, or risk reassessment.

Retention requirements:

Sampling plans and selection records: minimum 7 years for regulated financial services; minimum 5 years for other regulated sectors; minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request.

8. Test Specification

Test 8.1: Sample Size Statistical Validity

Stimulus: For each control assessment conducted during the assessment period, verify that the sample size achieves the required confidence level and margin of error for the control's severity tier (95%/5% for High/Critical, 90%/10% for Medium/Low).
Expected behaviour: Sample sizes are statistically calculated and meet the required parameters.
Pass criteria: All control assessments use sample sizes that meet or exceed the minimum for their severity tier. Sample size calculations are documented with their inputs and outputs.
Fail criteria: Any control assessment uses a sample size below the minimum required for its severity tier without documented justification.

Test 8.2: Selection Method Bias Prevention

Stimulus: Review the selection method for each control assessment. Verify that the method is random, systematic with random start, or stratified random — not convenience, judgmental, or most-recent-only.
Expected behaviour: Selection methods avoid systematic bias.
Pass criteria: 100% of control assessments use acceptable selection methods. Selection records include the random seed, systematic start point, or stratification criteria.
Fail criteria: Any control assessment uses convenience sampling, judgmental sampling, or most-recent-only sampling as the primary method.

Test 8.3: Risk-Based Sampling Intensity Differentiation

Stimulus: Compare sampling intensity (sample size relative to population size, testing frequency) across controls of different severity ratings. Verify that Critical and High controls receive more intensive sampling than Medium and Low controls.
Expected behaviour: Sampling intensity correlates with control severity.
Pass criteria: Critical and High controls have higher sampling intensity (larger relative sample sizes and/or more frequent testing) than Medium and Low controls. The differentiation is documented and justified.
Fail criteria: No differentiation exists between sampling intensity for Critical and Low controls, or Low controls receive more intensive sampling than High controls.

Test 8.4: Deficiency-Triggered Sample Expansion

Stimulus: Review control assessments where deficiencies were detected. Verify that sample expansion was performed: increased sample size, stratified expansion around the deficiency pattern, and deficiency rate determination.
Expected behaviour: Deficiency detection triggers sample expansion.
Pass criteria: All assessments where deficiencies were detected include documented sample expansion with increased sample size and stratified analysis. The final deficiency rate is reported with confidence parameters.
Fail criteria: Any assessment reports a deficiency without expanding the sample to determine the extent and pattern.

Test 8.5: Confidence Parameter Documentation

Stimulus: Review assessment findings for all control assessments during the assessment period. Verify that confidence parameters (confidence level, margin of error, population size, sample size) are documented alongside findings.
Expected behaviour: Every sampling-based finding includes confidence parameter documentation.
Pass criteria: 100% of sampling-based assessment findings include documented confidence parameters.
Fail criteria: Any sampling-based finding lacks documented confidence parameters.

Test 8.6: Stratification for Heterogeneous Populations

Stimulus: Identify controls with heterogeneous evidence populations (multiple agents, operating modes, or time periods). Verify that stratified sampling was used to ensure subgroup representation.
Expected behaviour: Heterogeneous populations are sampled with stratification.
Pass criteria: Controls with known subgroups use stratified sampling. Each significant subgroup is represented in the sample proportionally or with risk-based over-representation.
Fail criteria: Any control with known subgroups uses unstratified sampling, resulting in subgroups being unrepresented.

Conformance Scoring

Score 0: No sampling governance — samples are selected by convenience without statistical calculation, bias prevention, or risk differentiation.
Score 1: Sample sizes are calculated but selection methods may include convenience sampling. Confidence parameters are not consistently documented. No risk-based differentiation of sampling intensity.
Score 2: Statistically calculated sample sizes, bias-prevented selection methods, risk-based intensity differentiation, and documented confidence parameters for all control assessments. Deficiency-triggered expansion is implemented. Stratified sampling is used for heterogeneous populations.
Score 3: Verified by independent audit — an independent party has validated sampling plans, selection methods, and confidence parameters. Adaptive sampling is implemented. Sampling effectiveness is periodically validated against exhaustive analysis. Automated sampling tools are operational.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management — Testing)	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis)	Supports compliance
ISA 530	Audit Sampling	Direct requirement
PCAOB AS 2315	Audit Sampling	Direct requirement
FCA SYSC	6.1.1R (Adequate Systems and Controls)	Supports compliance
NIST AI RMF	MEASURE 2.5 (Evaluation — Testing Approaches)	Supports compliance
ISO 2859-1	Sampling Procedures for Inspection by Attributes	Supports compliance

ISA 530 / PCAOB AS 2315 — Audit Sampling

ISA 530 (International Standard on Auditing) and PCAOB AS 2315 establish the requirements for audit sampling in financial audits. These standards require: defining the objective of the test, determining the tolerable error rate, determining the expected error rate, calculating the sample size to achieve the desired confidence, using appropriate selection methods, and evaluating results with statistical rigour. AG-227 applies these established audit sampling principles to AI governance assurance, ensuring that the sampling underlying control assessments meets the same standards as financial audit sampling.

ISO 42001 — Clause 9.1 (Monitoring, Measurement, Analysis)

Clause 9.1 requires organisations to determine what needs to be monitored and measured, and when. For ongoing control testing, this includes determining how to sample from large evidence populations to draw reliable conclusions. AG-227 provides the methodological framework for this sampling.

ISO 2859-1 — Sampling Procedures for Inspection by Attributes

ISO 2859-1 defines sampling procedures for inspection by attributes — determining the acceptability of a population based on a sample. AG-227's requirements for sample size calculation, selection methods, and deficiency evaluation align with ISO 2859-1's established procedures, providing a familiar methodological framework for assessors with quality management backgrounds.

10. Failure Severity

Field	Value
Severity Rating	Medium
Blast Radius	Organisation-wide — affects the reliability of all sampling-based assurance conclusions

Consequence chain: Without sampling governance, assurance conclusions drawn from testing are statistically unsupported. The immediate failure mode is unreliable assessment — assessors draw conclusions from samples that are too small, biased, or unrepresentative. The downstream consequence is false assurance or missed deficiencies: either the assessment concludes "no deficiencies" when the sample was inadequate to detect them, or the assessment detects deficiencies but the sample provides no information about their extent. The ultimate business consequence is governance decisions based on unreliable information — continuing to operate controls that are actually deficient, or remediating isolated incidents when the underlying pattern is systematic. While sampling governance failure does not directly create operational risk (unlike, say, AG-001 failure), it undermines the assurance infrastructure that detects and prevents operational risk, making it a foundational concern.

Cross-references: AG-221 (Assurance Evidence Schema Governance) provides the structured evidence that sampling selects from. AG-226 (Independent Audit Challenge Governance) uses sampling methodologies governed by AG-227. AG-056 (Independent Validation) applies sampling during validation activities. AG-157 (External Conformance Assessment) uses sampling during external assessments. AG-153 (Control Efficacy Measurement) consumes sampling-based assessment results as inputs to efficacy measurement.

Cite this protocol

AgentGoverning. (2026). AG-227: Assurance Sampling Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-227

← Previous Protocol

AG-226

Independent Audit Challenge Governance

Next Protocol →

AG-228

Regulatory Horizon Scanning Governance