AG-271: Rule-Test Coverage Governance

2. Summary

Rule-Test Coverage Governance requires that every policy rule deployed in production has a structured, maintained test suite covering its logic, edge cases, boundary conditions, and interactions with other rules including precedence conflicts. Test coverage is not a one-time compilation check (which AG-270 addresses) but an ongoing operational discipline ensuring that the rule set continues to behave correctly as rules are added, modified, and removed over time. This dimension mandates minimum coverage thresholds, edge-case enumeration, precedence-conflict testing, and regression testing for every rule change, treating policy test coverage with the same rigour as unit test coverage in safety-critical software.

3. Example

Scenario A — Untested Rule Interaction Creates Eligibility Gap: A customer-facing agent uses 87 policy rules to determine product eligibility. Rule 34 states: "Customers aged 18-25 qualify for the starter tier." Rule 62 states: "Customers with income below £15,000 are restricted to basic products only." Neither rule is tested in combination. A 22-year-old customer with income of £12,000 is simultaneously eligible for the starter tier (Rule 34) and restricted to basic products (Rule 62). The rule engine resolves this by evaluating rules in numerical order, so Rule 62 overrides Rule 34 — but this precedence was never tested or documented. When Rule 34 is later reordered to execute after Rule 62 during a refactoring, the behaviour flips and the eligibility gap is not detected for 6 weeks.

What went wrong: Individual rules were tested in isolation but rule interactions were not tested. The implicit precedence (evaluation order) was not covered by any test. When evaluation order changed, the test suite did not catch the behavioural change. Consequence: 6 weeks of incorrect eligibility determinations affecting approximately 1,400 customers, regulatory complaint, remediation cost estimated at £420,000.

Scenario B — Edge Case in Date Handling Causes Policy Failure on Leap Year: A financial-value agent applies a policy rule: "Transactions initiated within 30 days of account opening require enhanced due diligence." The test suite includes tests for day 1, day 15, day 29, day 30, and day 31. No test covers the case where the 30-day window spans 28 February in a leap year. On 29 February 2028, the date arithmetic produces a boundary error: accounts opened on 1 February have their 30-day window calculated as ending on 2 March in non-leap years but 1 March in leap years. The test suite never exercises this case. 23 transactions on 29 February bypass enhanced due diligence.

What went wrong: The test suite covered the obvious boundary (day 30/31) but not the calendar edge case. The date arithmetic bug existed since deployment but only manifested on leap year dates. Consequence: 23 transactions processed without required due diligence, potential anti-money laundering compliance failure, mandatory suspicious activity report review.

Scenario C — Regression After "Minor" Rule Update: An enterprise workflow agent has a rule that requires manager approval for expenses above £500. A policy update changes the threshold to £750. The developer updates the threshold value and runs the existing test suite — all tests pass. However, the existing test suite only tests at £400 (below) and £600 (above). The updated threshold at £750 means the £600 test case now falls below the threshold, which changes the expected behaviour, but the test expectation was not updated. The test still expects "requires approval" for £600, which now fails — but the developer "fixes" this by updating the test expectation without questioning whether the behaviour change is correct. No test exists for the £750 boundary itself.

What went wrong: The test suite did not include boundary-value tests at the actual threshold. The "fix" to the test was backward — the developer changed the test to match the code rather than verifying the code against the policy. Consequence: Expenses between £500 and £750 no longer require approval, creating a £250 gap in the approval control that persists until the next audit.

4. Requirement Statement

Scope: This dimension applies to all AI agents governed by policy rules that are testable — meaning the rules accept defined inputs and produce deterministic outputs. This includes rule engines, decision tables, policy-as-code systems, scoring models with policy thresholds, and any system where policy logic can be exercised with controlled inputs. Machine learning models that produce probabilistic outputs are partially within scope: the policy thresholds applied to model outputs (e.g., "reject if fraud score > 0.85") are testable even if the model itself is not deterministic. The scope extends to the full lifecycle of the rule set — initial deployment, ongoing operation, and every change.

4.1. A conforming system MUST maintain a structured test suite for every deployed policy rule, covering: at least one positive case (input that triggers the rule), at least one negative case (input that should not trigger the rule), and boundary-value tests at every decision threshold.

4.2. A conforming system MUST test rule interactions where two or more rules can apply to the same input, verifying that the precedence resolution produces the correct outcome per AG-272 and AG-135.

4.3. A conforming system MUST execute the full test suite before any policy change is activated in production, and block activation if any test fails.

4.4. A conforming system MUST achieve a minimum rule-test coverage of 100% — every rule has at least one test. Coverage of 100% of decision boundaries and 100% of documented rule interactions is a MUST.

4.5. A conforming system MUST update tests when rules change, ensuring that test expectations reflect the approved policy — not the other way around.

4.6. A conforming system SHOULD implement edge-case enumeration for each rule, covering: type boundaries (null, empty, maximum values), temporal boundaries (leap years, daylight saving transitions, midnight, end-of-month), and domain-specific boundaries (currency conversion edge cases, jurisdiction boundary cases).

4.7. A conforming system SHOULD implement mutation testing to verify that the test suite detects injected faults — if inverting a rule's logic does not cause a test failure, the test suite is inadequate.

4.8. A conforming system SHOULD generate coverage reports showing which rules, boundaries, and interactions are tested and which are not, with automated alerts when coverage drops below the threshold.

4.9. A conforming system MAY implement property-based testing for policy rules, generating randomised inputs to discover unexpected rule behaviours not anticipated by manually authored tests.

5. Rationale

Test coverage for policy rules serves a fundamentally different purpose from test coverage for application software. In application software, untested code may contain bugs that affect functionality. In policy rules, untested rules may contain errors that affect compliance, safety, and legal liability. A policy rule that produces the wrong output is not a software bug — it is a governance failure that may result in regulatory enforcement, customer harm, or safety incidents.

The interaction between rules is where most production failures occur. Individual rules tested in isolation often work correctly. But when two rules apply to the same input, the interaction — through precedence, exception handling, or evaluation order — can produce outcomes that neither rule's individual tests anticipate. AG-271 specifically requires interaction testing (Requirement 4.2) because this is the gap most commonly exploited by real-world failures.

Boundary-value testing is mandatory (not merely recommended) because policy thresholds are where errors have the greatest impact. A rule that works correctly for inputs far from the boundary but fails at the boundary will affect exactly the cases where the decision is most consequential — the marginal cases where the outcome could go either way. These are also the cases most likely to be challenged by customers, regulators, or courts.

The requirement that tests must be updated when rules change (4.5) addresses the anti-pattern observed in Scenario C: developers who change test expectations to match code behaviour rather than verifying code behaviour against policy intent. The test suite is a second representation of the policy intent. When a rule changes, the correct workflow is: update the rule, update the test expectations to reflect the new approved policy, then verify that the rule matches the updated expectations. Changing the test to match the code defeats the purpose of testing.

6. Implementation Guidance

Recommended patterns:

Boundary-value test generation. For each rule with a numeric threshold T, automatically generate test cases at T-1, T, T+1, and (where applicable) T-0.01 and T+0.01. For rules with categorical inputs, generate tests for each category plus the empty/null case. For rules with date inputs, generate tests for: today, yesterday, end-of-month, leap year dates (29 Feb), daylight saving transitions, and the Unix epoch boundary.
Interaction matrix testing. Build a matrix of rules that can co-apply to the same input. For each pair (or higher-order combination where practical), generate at least one test input that triggers both rules simultaneously. Verify that the precedence resolution matches the documented precedence per AG-272. For a rule set of 87 rules where 15 pairs can co-apply, this produces at minimum 15 interaction tests.
Regression test automation. Integrate the policy test suite into the CI/CD pipeline. Every policy change triggers a full test run. Failed tests block deployment. Test results are stored as artefacts alongside the policy version identifier (per AG-269) and the compilation verification results (per AG-270).
Mutation testing for test adequacy. Periodically inject faults into the compiled policy (invert a comparison, shift a threshold by 1, swap two rule outputs) and verify that the test suite detects each mutation. Mutation detection rate below 90% indicates insufficient test coverage. Target mutation detection rate: 95% or above.
Coverage dashboards. Maintain a dashboard showing: percentage of rules with at least one test, percentage of decision boundaries with boundary-value tests, percentage of documented interactions with interaction tests, mutation detection rate, and trend over time. Automated alerts when any metric drops below threshold.

Anti-patterns to avoid:

Testing only the happy path. Tests that verify correct behaviour for straightforward inputs (well above threshold, well below threshold) provide false confidence. The failures occur at boundaries, not in the middle of ranges. A test suite with 100% rule coverage but 0% boundary coverage is inadequate.
Changing tests to match code instead of policy. When a test fails after a rule change, the correct response is to verify whether the code matches the approved policy. If the code is correct, update the test expectation to match the new policy. If the code is wrong, fix the code. Never change a test expectation without verifying against the approved policy.
Testing rules in isolation only. Individual rule tests confirm that each rule works independently. They do not confirm that rules work correctly together. Interaction testing is essential and often reveals failures that no individual test would catch.
Static test suites that never grow. The initial test suite covers the initial rule set. As rules are added and modified, the test suite must grow. A test suite that has the same number of tests after 2 years of rule changes is almost certainly inadequate — new rules were added without tests, or changed rules were not retested.
Over-reliance on manual testing. Manual testing of policy rules does not scale. An organisation with 340 rules, each requiring boundary-value and interaction tests, needs thousands of test cases. Manual execution is slow, error-prone, and creates a bottleneck that discourages frequent policy updates.

Industry Considerations

Financial Services. Regulatory rules often have complex interactions — anti-money laundering thresholds interact with customer due diligence rules, which interact with product eligibility rules. The FCA expects firms to test controls "end to end," which for policy rules means testing rule interactions, not just individual rules. The PRA's SS1/23 on model risk management extends testing expectations to rule-based decision systems.

Healthcare. Clinical decision support rules must be tested against clinical scenarios, not just technical inputs. A dosage calculation rule must be tested with realistic patient parameters including extreme values (paediatric, geriatric, renal impairment). The test suite should include cases derived from known adverse events and near-misses.

Critical Infrastructure. Safety-critical policy rules must be tested under failure conditions — what happens when sensor inputs are missing, delayed, or out of range? Edge-case testing for safety rules must include sensor failure modes, communication delays, and concurrent alarm conditions.

Maturity Model

Basic Implementation — Every deployed rule has at least one positive and one negative test case. Tests are executed manually or semi-automatically before policy changes. Test results are documented. Boundary-value tests exist for rules with numeric thresholds. Rule interactions are not systematically tested.

Intermediate Implementation — Automated test suite integrated into the CI/CD pipeline. Boundary-value tests are generated automatically for all thresholds. An interaction matrix identifies rule pairs that can co-apply, and each pair has at least one test. Mutation testing is performed periodically with a target detection rate of 90%. Coverage reports are generated automatically. Test expectations are reviewed against the approved policy on every change.

Advanced Implementation — All intermediate capabilities plus: property-based testing generates randomised inputs to discover unexpected behaviours. Mutation detection rate exceeds 95%. Edge-case enumeration covers temporal, type, and domain-specific boundaries. Test suite adequacy is independently verified. Coverage is tracked at the boundary level (not just the rule level), with 100% boundary coverage. The test suite itself is version-controlled alongside the policy, with traceability between rule changes and test changes.

7. Evidence Requirements

Required artefacts:

Test suite and results. The complete test suite for the deployed policy rule set, including test inputs, expected outputs, actual outputs, and pass/fail status. Format: structured test report (JUnit XML, JSON, or equivalent).
Coverage report. A coverage report showing: percentage of rules with tests, percentage of decision boundaries with boundary-value tests, percentage of documented interactions with interaction tests, and mutation detection rate. Format: structured report with per-rule detail.
Test-to-rule traceability. A mapping showing which tests cover which rules, and which rules are covered by which tests. Format: structured data.
Test execution history. Evidence that the full test suite was executed before each of the last 5 policy changes, with pass/fail results and timestamps.

Retention requirements:

Test suites and results: aligned with policy version retention per AG-269 — minimum 7 years for regulated financial services, minimum 5 years for other regulated sectors, minimum 3 years otherwise.

Access requirements:

Producible to regulators or auditors within 48 hours of request. Coverage reports must be current (generated within the last 30 days or after the most recent policy change, whichever is later).

8. Test Specification

Test 8.1: Rule Coverage Completeness

Stimulus: Enumerate all deployed policy rules. For each rule, verify that at least one positive test, one negative test, and boundary-value tests exist in the test suite.
Expected behaviour: 100% of rules have the required tests.
Pass criteria: Every deployed rule has at least one positive test, one negative test, and boundary-value tests for all thresholds.
Fail criteria: Any rule lacks the required test coverage.

Test 8.2: Interaction Coverage

Stimulus: Identify all documented rule interactions (pairs or groups of rules that can co-apply). Verify that each interaction has at least one test.
Expected behaviour: 100% of documented interactions have tests.
Pass criteria: Every documented rule interaction has at least one test that exercises both rules simultaneously and verifies the precedence outcome.
Fail criteria: Any documented interaction lacks a test.

Test 8.3: Pre-Deployment Gate Enforcement

Stimulus: Introduce a test failure (modify a rule to break a test) and attempt to deploy the policy change.
Expected behaviour: The deployment is blocked by the test failure.
Pass criteria: No policy change with a failing test is deployed to production.
Fail criteria: A policy change with a failing test is deployed to production.

Test 8.4: Mutation Detection Adequacy

Stimulus: Inject 20 mutations into the policy rule set (invert comparisons, shift thresholds by 1, swap outputs). Run the test suite against each mutation.
Expected behaviour: The test suite detects at least 18 of 20 mutations (90% detection rate).
Pass criteria: Mutation detection rate is at or above 90%.
Fail criteria: Mutation detection rate is below 90%.

Test 8.5: Boundary-Value Precision

Stimulus: For a rule with threshold at value X, verify that tests exist at X-1, X, and X+1 (or equivalent precision for the data type). Run these tests and verify correct outputs.
Expected behaviour: Tests at all three boundary points exist and produce correct results.
Pass criteria: Boundary-value tests are present and pass for all thresholds.
Fail criteria: Any boundary-value test is missing or produces incorrect results.

Test 8.6: Test Expectation Integrity

Stimulus: Review the last 5 policy changes. For each change, verify that test expectations were updated to reflect the approved policy (not merely to match the code change).
Expected behaviour: Each test expectation change is traceable to an approved policy change.
Pass criteria: All test expectation changes are justified by corresponding approved policy changes.
Fail criteria: Any test expectation was changed to match code behaviour without a corresponding approved policy change.

Test 8.7: Edge-Case Coverage

Stimulus: Submit inputs at known edge cases: null values, empty strings, maximum integer values, leap year dates (29 Feb), midnight UTC, and end-of-month dates. Verify that rules handle these inputs correctly or reject them explicitly.
Expected behaviour: Rules produce correct outputs for all edge cases or return structured rejections for invalid inputs.
Pass criteria: No edge-case input produces an incorrect decision or an unhandled exception.
Fail criteria: Any edge-case input causes an incorrect decision, unhandled exception, or silent failure.

Conformance Scoring

Score 0: No structured test suite exists for policy rules — rules are deployed without testing.
Score 1: Tests exist for some rules, but coverage is incomplete — fewer than 100% of rules have tests, boundary conditions are not systematically covered, and rule interactions are not tested.
Score 2: 100% rule coverage with boundary-value tests for all thresholds, interaction tests for all documented rule pairs, automated test execution in CI/CD, and deployment blocked on test failure.
Score 3: All Score 2 capabilities plus mutation testing above 95% detection rate, property-based testing, edge-case enumeration covering temporal and type boundaries, independently verified test adequacy, and 100% boundary-level coverage.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
EU AI Act	Article 15 (Accuracy, Robustness, Cybersecurity)	Supports compliance
FCA SYSC	6.1.1R (Systems and Controls)	Direct requirement
PRA SS1/23	Model Risk Management	Supports compliance
NIST AI RMF	MANAGE 2.2, MEASURE 2.5	Supports compliance
ISO 42001	Clause 8.4 (AI System Development), Clause 9.1 (Monitoring)	Supports compliance
IEC 61508	Part 3 (Software Requirements)	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires that risk management measures be "tested with a view to identifying the most appropriate risk management measures." Policy rules are risk management measures — they define the boundaries of acceptable agent behaviour. Testing those rules is a direct requirement of Article 9. The regulation's emphasis on testing "with a view to identifying the most appropriate measures" implies that testing should be comprehensive enough to reveal deficiencies, mapping to the mutation testing and edge-case requirements at higher maturity levels.

EU AI Act — Article 15 (Accuracy, Robustness, Cybersecurity)

Article 15 requires that high-risk AI systems achieve appropriate levels of accuracy and robustness. For rule-based policy systems, accuracy means the rules produce correct decisions. Robustness means the rules produce correct decisions across the full range of inputs, including edge cases. Rule-test coverage directly supports both requirements.

PRA SS1/23 — Model Risk Management

The PRA's supervisory statement extends model risk management expectations to rule-based systems that inform material decisions. Testing is a core expectation: "Firms should test models rigorously across a range of scenarios." For policy rules, this includes boundary testing, interaction testing, and stress testing under unusual conditions.

IEC 61508 — Part 3

IEC 61508 Part 3 specifies software testing requirements for safety-critical systems, including structural coverage, boundary-value testing, and fault injection. For AI agents operating in safety-critical domains, policy rule testing should meet the relevant Safety Integrity Level (SIL) testing requirements.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	All decisions involving the untested or inadequately tested rule — potentially organisation-wide

Consequence chain: Inadequate rule-test coverage allows policy errors to reach production undetected. The immediate technical failure is a policy rule that produces incorrect outputs for certain inputs. The operational impact depends on the frequency of those inputs: a boundary error at a common threshold may affect thousands of decisions per day; an edge case in leap-year date handling may affect decisions only once every 4 years but create a concentrated burst of failures. The regulatory consequence is a control failure — the organisation deployed a policy control without adequate testing, which is a systems and controls deficiency regardless of whether the control actually failed. In financial services, the FCA has fined firms for inadequate testing of automated controls (typical penalties in the range of £1.5 million to £8 million for systems and controls failures). In healthcare, untested clinical rules can cause patient harm. In safety-critical domains, untested safety rules can create conditions outside the safety envelope. The consequence compounds with the number of untested rules: an organisation with 340 rules and 20% untested has 68 potential undetected errors in production.

Cross-references: AG-270 (Policy Compilation Verification Governance) covers the initial verification that compiled rules match the source policy; AG-271 extends this to ongoing test coverage. AG-272 (Exception Precedence Governance) defines the precedence rules that interaction tests must verify. AG-269 (Policy Version Pinning Governance) provides the version identifier that links test results to specific policy versions. AG-275 (Policy Simulation Sandbox Governance) provides the environment for executing tests against realistic scenarios. AG-134 (Machine-Checkable Policy Semantics) enables automated test generation from formal policy specifications. AG-135 (Policy Precedence and Conflict Arbitration) defines how rule conflicts are resolved, which interaction tests must validate. AG-138 (High-Assurance Invariant Verification) provides formal methods applicable to verifying test completeness.

Cite this protocol

AgentGoverning. (2026). AG-271: Rule-Test Coverage Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-271

← Previous Protocol

AG-270

Policy Compilation Verification Governance

Next Protocol →

AG-272

Exception Precedence Governance