Rule-Test Coverage Governance requires that every policy rule deployed in production has a structured, maintained test suite covering its logic, edge cases, boundary conditions, and interactions with other rules including precedence conflicts. Test coverage is not a one-time compilation check (which AG-270 addresses) but an ongoing operational discipline ensuring that the rule set continues to behave correctly as rules are added, modified, and removed over time. This dimension mandates minimum coverage thresholds, edge-case enumeration, precedence-conflict testing, and regression testing for every rule change, treating policy test coverage with the same rigour as unit test coverage in safety-critical software.
Scenario A — Untested Rule Interaction Creates Eligibility Gap: A customer-facing agent uses 87 policy rules to determine product eligibility. Rule 34 states: "Customers aged 18-25 qualify for the starter tier." Rule 62 states: "Customers with income below £15,000 are restricted to basic products only." Neither rule is tested in combination. A 22-year-old customer with income of £12,000 is simultaneously eligible for the starter tier (Rule 34) and restricted to basic products (Rule 62). The rule engine resolves this by evaluating rules in numerical order, so Rule 62 overrides Rule 34 — but this precedence was never tested or documented. When Rule 34 is later reordered to execute after Rule 62 during a refactoring, the behaviour flips and the eligibility gap is not detected for 6 weeks.
What went wrong: Individual rules were tested in isolation but rule interactions were not tested. The implicit precedence (evaluation order) was not covered by any test. When evaluation order changed, the test suite did not catch the behavioural change. Consequence: 6 weeks of incorrect eligibility determinations affecting approximately 1,400 customers, regulatory complaint, remediation cost estimated at £420,000.
Scenario B — Edge Case in Date Handling Causes Policy Failure on Leap Year: A financial-value agent applies a policy rule: "Transactions initiated within 30 days of account opening require enhanced due diligence." The test suite includes tests for day 1, day 15, day 29, day 30, and day 31. No test covers the case where the 30-day window spans 28 February in a leap year. On 29 February 2028, the date arithmetic produces a boundary error: accounts opened on 1 February have their 30-day window calculated as ending on 2 March in non-leap years but 1 March in leap years. The test suite never exercises this case. 23 transactions on 29 February bypass enhanced due diligence.
What went wrong: The test suite covered the obvious boundary (day 30/31) but not the calendar edge case. The date arithmetic bug existed since deployment but only manifested on leap year dates. Consequence: 23 transactions processed without required due diligence, potential anti-money laundering compliance failure, mandatory suspicious activity report review.
Scenario C — Regression After "Minor" Rule Update: An enterprise workflow agent has a rule that requires manager approval for expenses above £500. A policy update changes the threshold to £750. The developer updates the threshold value and runs the existing test suite — all tests pass. However, the existing test suite only tests at £400 (below) and £600 (above). The updated threshold at £750 means the £600 test case now falls below the threshold, which changes the expected behaviour, but the test expectation was not updated. The test still expects "requires approval" for £600, which now fails — but the developer "fixes" this by updating the test expectation without questioning whether the behaviour change is correct. No test exists for the £750 boundary itself.
What went wrong: The test suite did not include boundary-value tests at the actual threshold. The "fix" to the test was backward — the developer changed the test to match the code rather than verifying the code against the policy. Consequence: Expenses between £500 and £750 no longer require approval, creating a £250 gap in the approval control that persists until the next audit.
Scope: This dimension applies to all AI agents governed by policy rules that are testable — meaning the rules accept defined inputs and produce deterministic outputs. This includes rule engines, decision tables, policy-as-code systems, scoring models with policy thresholds, and any system where policy logic can be exercised with controlled inputs. Machine learning models that produce probabilistic outputs are partially within scope: the policy thresholds applied to model outputs (e.g., "reject if fraud score > 0.85") are testable even if the model itself is not deterministic. The scope extends to the full lifecycle of the rule set — initial deployment, ongoing operation, and every change.
4.1. A conforming system MUST maintain a structured test suite for every deployed policy rule, covering: at least one positive case (input that triggers the rule), at least one negative case (input that should not trigger the rule), and boundary-value tests at every decision threshold.
4.2. A conforming system MUST test rule interactions where two or more rules can apply to the same input, verifying that the precedence resolution produces the correct outcome per AG-272 and AG-135.
4.3. A conforming system MUST execute the full test suite before any policy change is activated in production, and block activation if any test fails.
4.4. A conforming system MUST achieve a minimum rule-test coverage of 100% — every rule has at least one test. Coverage of 100% of decision boundaries and 100% of documented rule interactions is a MUST.
4.5. A conforming system MUST update tests when rules change, ensuring that test expectations reflect the approved policy — not the other way around.
4.6. A conforming system SHOULD implement edge-case enumeration for each rule, covering: type boundaries (null, empty, maximum values), temporal boundaries (leap years, daylight saving transitions, midnight, end-of-month), and domain-specific boundaries (currency conversion edge cases, jurisdiction boundary cases).
4.7. A conforming system SHOULD implement mutation testing to verify that the test suite detects injected faults — if inverting a rule's logic does not cause a test failure, the test suite is inadequate.
4.8. A conforming system SHOULD generate coverage reports showing which rules, boundaries, and interactions are tested and which are not, with automated alerts when coverage drops below the threshold.
4.9. A conforming system MAY implement property-based testing for policy rules, generating randomised inputs to discover unexpected rule behaviours not anticipated by manually authored tests.
Test coverage for policy rules serves a fundamentally different purpose from test coverage for application software. In application software, untested code may contain bugs that affect functionality. In policy rules, untested rules may contain errors that affect compliance, safety, and legal liability. A policy rule that produces the wrong output is not a software bug — it is a governance failure that may result in regulatory enforcement, customer harm, or safety incidents.
The interaction between rules is where most production failures occur. Individual rules tested in isolation often work correctly. But when two rules apply to the same input, the interaction — through precedence, exception handling, or evaluation order — can produce outcomes that neither rule's individual tests anticipate. AG-271 specifically requires interaction testing (Requirement 4.2) because this is the gap most commonly exploited by real-world failures.
Boundary-value testing is mandatory (not merely recommended) because policy thresholds are where errors have the greatest impact. A rule that works correctly for inputs far from the boundary but fails at the boundary will affect exactly the cases where the decision is most consequential — the marginal cases where the outcome could go either way. These are also the cases most likely to be challenged by customers, regulators, or courts.
The requirement that tests must be updated when rules change (4.5) addresses the anti-pattern observed in Scenario C: developers who change test expectations to match code behaviour rather than verifying code behaviour against policy intent. The test suite is a second representation of the policy intent. When a rule changes, the correct workflow is: update the rule, update the test expectations to reflect the new approved policy, then verify that the rule matches the updated expectations. Changing the test to match the code defeats the purpose of testing.
Recommended patterns:
Anti-patterns to avoid:
Financial Services. Regulatory rules often have complex interactions — anti-money laundering thresholds interact with customer due diligence rules, which interact with product eligibility rules. The FCA expects firms to test controls "end to end," which for policy rules means testing rule interactions, not just individual rules. The PRA's SS1/23 on model risk management extends testing expectations to rule-based decision systems.
Healthcare. Clinical decision support rules must be tested against clinical scenarios, not just technical inputs. A dosage calculation rule must be tested with realistic patient parameters including extreme values (paediatric, geriatric, renal impairment). The test suite should include cases derived from known adverse events and near-misses.
Critical Infrastructure. Safety-critical policy rules must be tested under failure conditions — what happens when sensor inputs are missing, delayed, or out of range? Edge-case testing for safety rules must include sensor failure modes, communication delays, and concurrent alarm conditions.
Basic Implementation — Every deployed rule has at least one positive and one negative test case. Tests are executed manually or semi-automatically before policy changes. Test results are documented. Boundary-value tests exist for rules with numeric thresholds. Rule interactions are not systematically tested.
Intermediate Implementation — Automated test suite integrated into the CI/CD pipeline. Boundary-value tests are generated automatically for all thresholds. An interaction matrix identifies rule pairs that can co-apply, and each pair has at least one test. Mutation testing is performed periodically with a target detection rate of 90%. Coverage reports are generated automatically. Test expectations are reviewed against the approved policy on every change.
Advanced Implementation — All intermediate capabilities plus: property-based testing generates randomised inputs to discover unexpected behaviours. Mutation detection rate exceeds 95%. Edge-case enumeration covers temporal, type, and domain-specific boundaries. Test suite adequacy is independently verified. Coverage is tracked at the boundary level (not just the rule level), with 100% boundary coverage. The test suite itself is version-controlled alongside the policy, with traceability between rule changes and test changes.
Required artefacts:
Retention requirements:
Access requirements:
Test 8.1: Rule Coverage Completeness
Test 8.2: Interaction Coverage
Test 8.3: Pre-Deployment Gate Enforcement
Test 8.4: Mutation Detection Adequacy
Test 8.5: Boundary-Value Precision
Test 8.6: Test Expectation Integrity
Test 8.7: Edge-Case Coverage
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 9 (Risk Management System) | Direct requirement |
| EU AI Act | Article 15 (Accuracy, Robustness, Cybersecurity) | Supports compliance |
| FCA SYSC | 6.1.1R (Systems and Controls) | Direct requirement |
| PRA SS1/23 | Model Risk Management | Supports compliance |
| NIST AI RMF | MANAGE 2.2, MEASURE 2.5 | Supports compliance |
| ISO 42001 | Clause 8.4 (AI System Development), Clause 9.1 (Monitoring) | Supports compliance |
| IEC 61508 | Part 3 (Software Requirements) | Supports compliance |
Article 9 requires that risk management measures be "tested with a view to identifying the most appropriate risk management measures." Policy rules are risk management measures — they define the boundaries of acceptable agent behaviour. Testing those rules is a direct requirement of Article 9. The regulation's emphasis on testing "with a view to identifying the most appropriate measures" implies that testing should be comprehensive enough to reveal deficiencies, mapping to the mutation testing and edge-case requirements at higher maturity levels.
Article 15 requires that high-risk AI systems achieve appropriate levels of accuracy and robustness. For rule-based policy systems, accuracy means the rules produce correct decisions. Robustness means the rules produce correct decisions across the full range of inputs, including edge cases. Rule-test coverage directly supports both requirements.
The PRA's supervisory statement extends model risk management expectations to rule-based systems that inform material decisions. Testing is a core expectation: "Firms should test models rigorously across a range of scenarios." For policy rules, this includes boundary testing, interaction testing, and stress testing under unusual conditions.
IEC 61508 Part 3 specifies software testing requirements for safety-critical systems, including structural coverage, boundary-value testing, and fault injection. For AI agents operating in safety-critical domains, policy rule testing should meet the relevant Safety Integrity Level (SIL) testing requirements.
| Field | Value |
|---|---|
| Severity Rating | High |
| Blast Radius | All decisions involving the untested or inadequately tested rule — potentially organisation-wide |
Consequence chain: Inadequate rule-test coverage allows policy errors to reach production undetected. The immediate technical failure is a policy rule that produces incorrect outputs for certain inputs. The operational impact depends on the frequency of those inputs: a boundary error at a common threshold may affect thousands of decisions per day; an edge case in leap-year date handling may affect decisions only once every 4 years but create a concentrated burst of failures. The regulatory consequence is a control failure — the organisation deployed a policy control without adequate testing, which is a systems and controls deficiency regardless of whether the control actually failed. In financial services, the FCA has fined firms for inadequate testing of automated controls (typical penalties in the range of £1.5 million to £8 million for systems and controls failures). In healthcare, untested clinical rules can cause patient harm. In safety-critical domains, untested safety rules can create conditions outside the safety envelope. The consequence compounds with the number of untested rules: an organisation with 340 rules and 20% untested has 68 potential undetected errors in production.
Cross-references: AG-270 (Policy Compilation Verification Governance) covers the initial verification that compiled rules match the source policy; AG-271 extends this to ongoing test coverage. AG-272 (Exception Precedence Governance) defines the precedence rules that interaction tests must verify. AG-269 (Policy Version Pinning Governance) provides the version identifier that links test results to specific policy versions. AG-275 (Policy Simulation Sandbox Governance) provides the environment for executing tests against realistic scenarios. AG-134 (Machine-Checkable Policy Semantics) enables automated test generation from formal policy specifications. AG-135 (Policy Precedence and Conflict Arbitration) defines how rule conflicts are resolved, which interaction tests must validate. AG-138 (High-Assurance Invariant Verification) provides formal methods applicable to verifying test completeness.