AG-692: Content Enforcement Consistency Governance

2. Summary

Content Enforcement Consistency Governance requires that moderation decisions — content removals, account warnings, temporary suspensions, permanent bans, and visibility restrictions — are applied uniformly across all moderators, models, and enforcement pipelines, such that materially identical content in materially identical contexts receives materially identical outcomes regardless of which moderator or model processes the case. Inconsistency in content enforcement undermines user trust, creates exploitable arbitrage opportunities where bad actors resubmit violating content until a more lenient moderator or model processes it, exposes platforms to discrimination claims when enforcement disparities correlate with protected characteristics, and renders community guidelines functionally meaningless when identical posts receive contradictory outcomes. This dimension mandates the measurement, monitoring, and remediation of inter-moderator and inter-model enforcement variance, establishing quantitative consistency thresholds and requiring root cause analysis when variance exceeds acceptable bounds.

3. Example

Scenario A — Inter-Model Inconsistency Creates Enforcement Arbitrage: A social media platform with 280 million monthly active users operates three content moderation models deployed across regional inference clusters: Model-A serves North American traffic, Model-B serves European traffic, and Model-C serves Asia-Pacific traffic. All three models were trained against the same community guidelines but on different fine-tuning datasets reflecting regional content samples. A coordinated harassment campaign targets a public figure using identical text-and-image posts submitted from accounts across all three regions. Model-A classifies 94% of the posts as violating the harassment policy and removes them within 12 minutes. Model-B classifies 78% as violating and removes them within 18 minutes. Model-C classifies only 41% as violating and removes them within 35 minutes. Bad actors quickly discover the inconsistency and route identical harassment content through Asia-Pacific VPN endpoints, exploiting Model-C's lower detection rate. Over 72 hours, 14,200 harassing posts remain visible in the Asia-Pacific cluster — posts that would have been removed in seconds in North America. The targeted individual documents the discrepancy and publishes a thread showing identical posts with different enforcement outcomes, which is viewed 4.7 million times. The platform faces a parliamentary inquiry in Australia, a £3.2 million regulatory fine under the UK Online Safety Act for inconsistent enforcement of its own standards, and advertiser withdrawals totalling £18 million in the subsequent quarter.

What went wrong: The platform deployed regionally segmented models without measuring inter-model agreement on identical content. No consistency benchmark existed that would have detected the 94% vs. 41% enforcement gap before the campaign exposed it. The fine-tuning process introduced regional bias without a consistency validation step. No mechanism existed to detect that identical content was receiving materially different enforcement outcomes across models. Consequence: £3.2 million regulatory fine, £18 million in lost advertising revenue, parliamentary inquiry, severe reputational harm, and direct harm to the targeted individual who was subjected to sustained harassment that should have been uniformly removed.

Scenario B — Inter-Moderator Variance Produces Discriminatory Outcomes: A marketplace platform employs 340 human moderators and 4 AI moderation models to review seller listings flagged for potential policy violations. A quarterly consistency audit reveals that the removal rate for listings flagged as "counterfeit goods" varies by moderator cohort: Moderator Cohort A (110 moderators, primarily reviewing English-language listings) removes 67% of flagged listings. Moderator Cohort B (85 moderators, reviewing listings in Spanish and Portuguese) removes 52% of flagged listings. Moderator Cohort C (145 moderators, reviewing listings in Arabic, Turkish, and Urdu) removes 83% of flagged listings. An independent review of a stratified sample of 2,400 listings across all three cohorts by a calibrated review panel finds that the objectively justified removal rate should be approximately 61-65% across all language groups, with no statistically significant variation by language. The disparity means that sellers posting in Arabic, Turkish, and Urdu are 27% more likely to have listings removed than sellers posting in English for equivalent content. Over 6 months, approximately 3,100 listings from Arabic-, Turkish-, and Urdu-language sellers were removed that would not have been removed had they been reviewed under the same standards applied to English-language sellers. The platform faces a class-action discrimination lawsuit from affected sellers, a regulatory investigation under the EU Digital Services Act for failing to apply terms of service in a non-discriminatory manner, and a £5.8 million settlement.

What went wrong: The platform did not measure inter-moderator consistency across language cohorts. The 83% vs. 52% removal rate variance was not detected because no cross-cohort calibration programme existed. Moderator training materials were translated but not calibrated across languages — the same guideline produced different enforcement thresholds due to cultural interpretation differences and training dataset imbalances. No consistency metric was computed or monitored. The enforcement disparity correlated with language (and by proxy, ethnicity and national origin), creating a disparate impact that was both a fairness violation and a legal liability. Consequence: £5.8 million settlement, regulatory investigation, remediation programme costing £2.1 million, and loss of trust among sellers in affected language communities.

Scenario C — Temporal Inconsistency After Model Update: A video-sharing platform deploys a new version of its content moderation model (v4.2) to replace the existing model (v4.1). The new model was validated against the platform's internal benchmark suite and showed a 3.1% improvement in overall accuracy. However, the benchmark suite did not include consistency testing against the prior model's decisions on active enforcement cases. Within 48 hours of deployment, the platform's appeal queue increases by 340%. Investigation reveals that v4.2 reclassifies 12.7% of content that v4.1 had permitted as now-violating, and 8.3% of content that v4.1 had removed as now-permitted. Users who had posted content that was approved under v4.1 receive retroactive warnings and strikes under v4.2 when the content is re-scanned. Simultaneously, content that had been correctly removed under v4.1 is restored by v4.2's automated re-review, including 230 posts containing graphic violence and 47 posts containing non-consensual intimate imagery. The platform receives 4,200 user complaints in 72 hours, the press covers the re-surfacing of harmful content, and two regulatory authorities open inquiries. The platform rolls back to v4.1 at an operational cost of £890,000 and a reputational cost that is difficult to quantify.

What went wrong: Model update validation focused on aggregate accuracy without measuring decision consistency against the prior model's enforcement record. No consistency impact assessment was performed before deployment. The 12.7% reclassification rate — meaning roughly 1 in 8 previously-permitted posts would now be flagged — should have been detected in pre-deployment testing. No transition protocol existed to handle cases where the new model disagreed with the old model's prior decisions. The rollback revealed that the platform had no model consistency governance framework. Consequence: £890,000 rollback cost, re-surfacing of 277 harmful content items, 4,200 user complaints, two regulatory inquiries, and severe trust erosion among creators and users.

4. Requirement Statement

Scope: This dimension applies to every deployment where content moderation decisions — including removal, warning, restriction, labelling, demotion, suspension, or ban — are made by multiple moderators (human or AI) or multiple models operating across regions, languages, time periods, or content queues. The scope covers both automated moderation (model-only enforcement) and human-in-the-loop moderation (model recommendation with human review). The scope extends to any system where a model update, configuration change, or moderator roster change could alter enforcement outcomes for materially identical content. It includes consistency across moderator cohorts, across model versions, across regional deployments, across language-specific pipelines, and across temporal boundaries (pre- and post-update consistency). The scope excludes legitimate jurisdictional variation where different legal regimes require different enforcement thresholds — but such variation must be explicitly documented and attributable to a specific legal requirement, not to uncontrolled model or moderator variance.

4.1. A conforming system MUST define and publish quantitative consistency metrics that measure inter-moderator agreement and inter-model agreement on enforcement decisions, using established statistical measures (such as Cohen's kappa, Fleiss' kappa, Krippendorff's alpha, or percentage agreement with chance correction) appropriate to the number of raters and decision categories.

4.2. A conforming system MUST establish minimum consistency thresholds for each enforcement category (removal, warning, restriction, ban, etc.), documented in the governance configuration, with thresholds set at or above a Cohen's kappa of 0.70 for binary enforcement decisions and a Krippendorff's alpha of 0.67 for multi-category enforcement decisions, or equivalent industry-standard thresholds with documented justification for alternative values.

4.3. A conforming system MUST implement continuous consistency monitoring that measures enforcement agreement at defined intervals — at minimum monthly — across all active moderators and models, with automated alerting when consistency falls below the defined thresholds.

4.4. A conforming system MUST conduct inter-rater calibration exercises at defined intervals — at minimum quarterly — where a common set of at least 200 test cases (spanning all policy categories and content types) is independently reviewed by all moderators and models, and the resulting agreement scores are computed and compared against thresholds.

4.5. A conforming system MUST perform a consistency impact assessment before deploying any new model version, model configuration change, or material change to moderation guidelines, measuring the new model's or guideline's enforcement decisions against a held-out set of at least 500 cases with known prior decisions, and documenting the reclassification rate and any categories where enforcement direction changes.

4.6. A conforming system MUST investigate and document the root cause of every consistency threshold breach within 30 calendar days of detection, with findings that distinguish between training data bias, guideline ambiguity, moderator calibration drift, model regression, and jurisdictional variation.

4.7. A conforming system MUST ensure that consistency metrics are disaggregated by language, region, content category, and moderator cohort, such that aggregate consistency that masks sub-population disparities does not conceal discriminatory enforcement patterns.

4.8. A conforming system MUST maintain a consistency remediation log documenting all threshold breaches, root causes, remediation actions taken, and post-remediation consistency measurements confirming that the breach has been resolved.

4.9. A conforming system SHOULD implement a golden-set calibration programme that maintains a curated, regularly updated set of at least 500 content items with authoritative enforcement labels determined by a senior policy panel, against which all moderators and models are periodically benchmarked.

4.10. A conforming system SHOULD monitor consistency trends over time to detect gradual drift — a slow decline in inter-rater agreement that does not breach thresholds in any single measurement period but represents a cumulative degradation that, if unchecked, will eventually breach thresholds.

4.11. A conforming system SHOULD implement cross-cohort blind review, where a random sample of enforcement decisions made by one moderator cohort is independently reviewed by a moderator from a different cohort, to detect systematic enforcement bias between cohorts.

4.12. A conforming system MAY implement real-time consistency checks where, for high-severity enforcement categories (e.g., child safety, terrorism, non-consensual intimate imagery), each enforcement decision is independently evaluated by at least two moderators or models before execution, with discordant decisions routed to specialist review per AG-691.

4.13. A conforming system MAY publish transparency reports that include consistency metrics — at minimum, aggregate inter-rater agreement scores and the number of consistency threshold breaches — to provide external accountability for enforcement consistency.

5. Rationale

Content enforcement consistency is the operational prerequisite for community guidelines having any functional meaning. A community guideline that states "harassment is not permitted" is a meaningless declaration if the same harassing post is removed by one moderator and permitted by another, or removed by one model and permitted by the same model's regional variant. Users derive their understanding of platform norms not from the published text of community guidelines but from the observed enforcement patterns — and when those patterns are inconsistent, users correctly conclude that the guidelines are arbitrary, selectively enforced, or untrustworthy.

Inconsistency creates three distinct categories of harm. First, direct harm to affected individuals: when harassment, hate speech, or dangerous content is removed in one enforcement pipeline but permitted in another, the victims in the permissive pipeline suffer real harm that the platform's own policies acknowledge should have been prevented. Second, systemic fairness harm: when enforcement inconsistency correlates with language, region, or content creator demographics — as it almost always does when consistency is uncontrolled — the result is discriminatory enforcement that disproportionately harms or benefits specific communities. Sellers whose listings are moderated by a stricter cohort lose revenue that equivalent sellers in a more lenient cohort retain. Creators whose content is reviewed by a more aggressive model lose audience reach that equivalent creators reviewed by a less aggressive model maintain. These disparities compound over time and erode platform legitimacy. Third, adversarial exploitation: sophisticated bad actors systematically probe for enforcement inconsistency and exploit it. Forum-shopping across regional endpoints, resubmission until a lenient model processes the case, and timing submissions to coincide with moderator shift changes are all well-documented exploitation strategies that become viable only when enforcement is inconsistent.

The challenge of consistency is compounded by the inherent ambiguity of content moderation. Unlike financial transaction monitoring, where compliance thresholds are often numerically defined, content moderation involves subjective judgement about context, intent, severity, and community impact. Reasonable moderators will disagree on borderline cases. The goal of consistency governance is not to eliminate all disagreement — which would be neither possible nor desirable — but to ensure that disagreement remains within acceptable bounds, that systematic biases are detected and corrected, and that the most serious categories of harmful content receive the highest consistency.

Regulatory pressure for enforcement consistency is intensifying. The EU Digital Services Act (DSA) Article 14 requires platforms to apply terms of service in a "diligent, objective, and proportionate manner" with "due regard to the rights and legitimate interests of all parties involved." Inconsistent enforcement — where identical content receives different outcomes — is, by definition, not objective and not proportionate. The UK Online Safety Act requires platforms to enforce their own published safety policies consistently; a demonstrated gap between stated policy and actual enforcement creates regulatory liability. The Australian Online Safety Act empowers the eSafety Commissioner to assess whether platforms are meeting their own safety standards, and enforcement inconsistency is direct evidence of non-compliance.

The scale of modern content moderation makes consistency governance technically and organisationally challenging. A platform processing 500 million pieces of content per day through multiple models and thousands of human moderators across dozens of languages is operating an enforcement system of extraordinary complexity. Without active consistency governance — calibration, measurement, monitoring, and remediation — inconsistency is not a risk; it is a certainty. Models trained on different data distributions will produce different decisions. Moderators trained at different times, in different languages, by different trainers, will apply guidelines differently. Regional deployment variations will create enforcement gaps. The question is not whether inconsistency exists but whether the organisation measures it, sets thresholds, and remediates breaches.

6. Implementation Guidance

Content enforcement consistency governance requires a measurement infrastructure, a calibration programme, and a remediation workflow that together ensure enforcement decisions remain within acceptable variance bounds across all moderators, models, regions, languages, and time periods.

Recommended patterns:

Shared golden-set calibration. Maintain a curated golden set of at least 500 content items (recommended 1,000+) spanning all policy categories, severity levels, content types (text, image, video, audio), and languages. Each item has an authoritative enforcement label determined by a senior policy panel of at least 5 members with documented inter-rater agreement above 0.85 kappa. All moderators (human and AI) are benchmarked against the golden set at minimum quarterly. Individual moderators scoring below the consistency threshold are enrolled in recalibration training before resuming independent review. Models scoring below the threshold are flagged for retraining or parameter adjustment. The golden set is updated at least semi-annually to reflect policy changes and emerging content patterns.
Layered consistency measurement. Measure consistency at multiple levels: pairwise inter-moderator agreement (every pair of moderators who reviewed the same or equivalent content), cohort-level agreement (agreement within and between language, regional, or shift cohorts), model-to-model agreement (agreement between regional or versioned model deployments), and model-to-human agreement (agreement between automated decisions and human review panels). Pairwise measurement identifies individual outliers. Cohort measurement identifies systemic training or calibration gaps. Model comparison identifies fine-tuning divergence. Model-to-human comparison validates that automated enforcement aligns with human policy interpretation.
Pre-deployment consistency gating. Before deploying a new model version or material guideline change, run the candidate against a consistency test suite of cases with known prior enforcement decisions. Define a maximum acceptable reclassification rate (recommended: no more than 5% for low-severity categories, no more than 2% for high-severity categories). If the reclassification rate exceeds the threshold, the deployment is gated pending review by the policy team. For reclassifications in high-severity categories (child safety, terrorism, non-consensual intimate imagery), each individual reclassification is reviewed by a specialist before the deployment proceeds. This prevents the temporal inconsistency demonstrated in Scenario C.
Disaggregated consistency dashboards. Present consistency metrics disaggregated by language, region, content category, severity level, and moderator cohort on operational dashboards accessible to the Trust & Safety leadership. Aggregate consistency scores that mask sub-population disparities are insufficient — a platform may show 0.78 kappa overall while specific language cohorts show 0.52 kappa. Dashboards should highlight the lowest-performing disaggregation segment and provide drill-down capability to investigate.
Discordance routing for high-severity categories. For the most serious enforcement categories — child sexual abuse material, terrorism and violent extremism, non-consensual intimate imagery, imminent violence threats — implement dual-model or dual-moderator review where the second review is independent of the first. If the two reviews disagree, the case is automatically routed to specialist review per AG-691 before any enforcement action (or non-action) is taken. This sacrifices throughput for consistency in categories where enforcement errors cause the greatest harm.

Anti-patterns to avoid:

Aggregate-only consistency metrics. Reporting only platform-wide consistency scores without disaggregation. A platform with high overall consistency may have severe inconsistency in specific languages, regions, or content categories. Aggregate metrics create false assurance and mask the most harmful disparities.
Accuracy without consistency. Evaluating moderation models solely on accuracy (precision, recall, F1) against a test set without measuring agreement between models or between model versions. A model that is 92% accurate on average may disagree with another 92%-accurate model on 15% of cases — the accuracy metric does not capture this inconsistency.
Post-incident consistency investigation. Discovering enforcement inconsistency only when a user, journalist, or regulator documents contradictory enforcement decisions. By the time external actors identify the inconsistency, thousands or millions of inconsistent decisions have already been made. Proactive monitoring is essential.
Training-without-calibration. Providing moderator training on policy guidelines without subsequent calibration testing. Training imparts knowledge but does not guarantee consistent application. A moderator who passes a training course may still apply the guidelines differently from a colleague who also passed. Only calibration against a common golden set verifies that training has produced consistent outcomes.
Assuming model consistency across regions. Deploying the same model architecture with region-specific fine-tuning and assuming enforcement consistency because the architecture is shared. Fine-tuning is the primary determinant of enforcement behaviour — different fine-tuning datasets produce different enforcement boundaries. Each regional fine-tuning variant must be independently validated for cross-regional consistency.

Industry Considerations

Social Media and User-Generated Content Platforms. These platforms face the highest volume and the greatest linguistic and cultural diversity in content moderation. Consistency governance must account for the fact that the same words or images carry different meaning and severity across cultural contexts. The challenge is distinguishing legitimate contextual variation (a gesture that is offensive in one culture but neutral in another) from unjustified inconsistency (identical harassment treated differently across languages). Platforms should invest in multilingual golden sets developed with native-speaker policy experts for each supported language.

Online Marketplaces. Marketplace moderation involves product listing compliance, counterfeit detection, prohibited item enforcement, and seller conduct policies. Consistency is critical because enforcement disparity directly affects seller revenue. A seller whose listings are moderated more strictly than a competitor's equivalent listings suffers economic harm. Marketplace platforms should measure enforcement consistency by product category, seller language, and seller region, with particular attention to categories where human judgement plays a significant role (e.g., "misleading claims" versus straightforward prohibited item rules).

Gaming and Interactive Platforms. In-game moderation and chat moderation present unique consistency challenges due to the real-time nature of interactions, the use of coded language and in-group terminology that evolves rapidly, and the cultural variation in what constitutes harassment versus competitive banter. Consistency governance for gaming platforms should include moderator specialisation by game community, regular terminology calibration as community language evolves, and measurement of consistency across synchronous (real-time chat) and asynchronous (reported content) moderation pipelines.

Public Sector Platforms. Government-operated community platforms (citizen feedback portals, public consultation forums, municipal social media accounts) face heightened free expression requirements. Inconsistent content moderation by a government platform may constitute viewpoint discrimination in violation of constitutional or human rights protections. Public sector platforms should implement the most rigorous consistency governance, with documented justification for every enforcement action and independent review of enforcement consistency by an authority external to the content management team.

Maturity Model

Basic Implementation — The organisation has defined quantitative consistency metrics and minimum thresholds. Consistency is measured at defined intervals (at minimum monthly). Inter-rater calibration exercises are conducted quarterly with a common test set of at least 200 cases. Consistency metrics are disaggregated by language, region, and content category. Root cause investigation is conducted for threshold breaches. Consistency impact assessments are performed before model updates. All mandatory requirements (4.1 through 4.8) are satisfied.

Intermediate Implementation — All basic capabilities plus: a golden-set calibration programme with at least 500 curated cases is operational and benchmarks all moderators and models quarterly. Consistency trend monitoring detects gradual drift before threshold breaches occur. Cross-cohort blind review validates inter-cohort consistency. Pre-deployment consistency gating blocks model deployments that exceed reclassification thresholds. Consistency metrics are reported to senior leadership monthly and included in Trust & Safety operational reviews.

Advanced Implementation — All intermediate capabilities plus: real-time dual-review for high-severity categories routes discordant decisions to specialist review. Consistency metrics are published in external transparency reports. Predictive models identify moderators or models at risk of calibration drift based on decision pattern analysis. Multilingual golden sets are developed with native-speaker policy panels for each supported language. Independent audit annually validates the consistency measurement methodology, golden-set integrity, and remediation effectiveness. Consistency governance is integrated with AG-689 (Abuse Taxonomy), AG-695 (Repeat-Offender Linkage), and AG-696 (Appeal and Reinstatement) for end-to-end enforcement coherence.

7. Evidence Requirements

Required artefacts:

Consistency metric definitions and thresholds. Documentation of the quantitative consistency metrics in use (kappa, alpha, percentage agreement), the minimum thresholds per enforcement category, and the justification for the selected thresholds and methods. Must be version-controlled and approved by the governance function.
Calibration exercise records. Records of every inter-rater calibration exercise, including: date, test case set identifier, participating moderators/models, individual and aggregate agreement scores, comparison against thresholds, and any remediation actions triggered for below-threshold performers.
Continuous monitoring reports. Monthly (or more frequent) consistency monitoring reports showing measured agreement scores disaggregated by language, region, content category, and moderator cohort. Must include trend data showing consistency over at least the prior 6 months.
Consistency impact assessments. Pre-deployment consistency impact assessments for every model update or material guideline change, including: the held-out case set used, the reclassification rate, the category-level breakdown of reclassifications, and the deployment decision (proceed, gate, or block).
Root cause investigation and remediation log. Documentation of every consistency threshold breach, including: detection date, affected segment (language, region, cohort, model), measured consistency score, root cause findings, remediation actions, implementation dates, and post-remediation consistency measurements.
Golden-set documentation. If a golden-set calibration programme is implemented (4.9): the golden-set composition, the policy panel membership and qualifications, the labelling methodology, inter-panel agreement scores, and the update schedule.

Retention requirements:

Calibration exercise records and monitoring reports: minimum 5 years for platforms subject to the EU DSA or UK Online Safety Act; minimum 3 years otherwise.
Consistency impact assessments: retained for the operational life of each model version plus 3 years.
Root cause investigation and remediation records: minimum 5 years.

Access requirements:

Producible to regulators, auditors, or court-appointed examiners within 48 hours of request. Evidence must exist as retained artefacts with integrity verification (hash or digital signature), not reconstructable after the fact.

8. Test Specification

Test 8.1: Consistency Metric Definition and Threshold Existence

Stimulus: Request the documented consistency metrics and minimum thresholds. Verify that quantitative metrics are defined using established statistical measures, that minimum thresholds are set per enforcement category, and that threshold values meet or exceed the minimums specified in Requirement 4.2 (Cohen's kappa >= 0.70 for binary decisions, Krippendorff's alpha >= 0.67 for multi-category decisions).
Expected behaviour: Metrics and thresholds are documented, quantitative, and use established statistical measures.
Pass criteria: At least one established agreement metric is defined with minimum thresholds per enforcement category. Thresholds meet the specified minimums or alternative values are documented with statistical justification. The documentation is version-controlled and approved.
Fail criteria: No consistency metrics are defined, metrics are qualitative rather than quantitative, thresholds are not set per enforcement category, or thresholds fall below the specified minimums without documented justification.

Test 8.2: Continuous Monitoring and Alerting

Stimulus: Retrieve consistency monitoring reports for the past 6 months. Verify that monitoring is conducted at minimum monthly, that results are disaggregated by language, region, content category, and moderator cohort, and that automated alerting is configured for threshold breaches.
Expected behaviour: Monthly monitoring reports exist with disaggregated consistency scores and configured alerting.
Pass criteria: Monitoring reports exist for every month in the 6-month sample period. Each report contains disaggregated consistency scores for at least language, region, and content category. Alert configuration documentation shows automated alerting is active for threshold breaches. At least one alert has been triggered (or a documented test of the alerting mechanism within the past 6 months confirms functionality).
Fail criteria: Monitoring gaps exceed 45 days, reports contain only aggregate scores without disaggregation, or no alerting mechanism is configured.

Test 8.3: Inter-Rater Calibration Exercise Execution

Stimulus: Retrieve records of inter-rater calibration exercises for the past 12 months. Verify that exercises are conducted at minimum quarterly, use a common test set of at least 200 cases, and produce individual and aggregate agreement scores compared against thresholds.
Expected behaviour: At least 4 calibration exercises in 12 months, each with >= 200 test cases and documented results.
Pass criteria: At least 4 exercises are documented. Each exercise used a test set of at least 200 cases spanning multiple policy categories and content types. Individual moderator/model scores are recorded. Aggregate scores are compared against thresholds. Below-threshold performers have documented remediation actions.
Fail criteria: Fewer than 3 exercises in 12 months, test sets contain fewer than 200 cases, individual scores are not recorded, or below-threshold results have no documented follow-up.

Test 8.4: Pre-Deployment Consistency Impact Assessment

Stimulus: Identify all model version deployments and material guideline changes in the past 12 months. For each, retrieve the consistency impact assessment. Verify that the assessment measured reclassification rates against a held-out set of at least 500 cases with known prior decisions.
Expected behaviour: Every model deployment and material guideline change has an associated consistency impact assessment.
Pass criteria: 100% of model deployments and material guideline changes have documented impact assessments. Each assessment used a held-out set of at least 500 cases. Reclassification rates are documented overall and by category. Deployments exceeding reclassification thresholds have documented review and approval by the policy team.
Fail criteria: Any model deployment or material guideline change lacks a consistency impact assessment, the held-out set contains fewer than 500 cases, or reclassification rates are not documented.

Test 8.5: Root Cause Investigation Completeness

Stimulus: Retrieve all consistency threshold breaches detected in the past 12 months. For each, verify that a root cause investigation was conducted within 30 calendar days of detection and that the investigation distinguishes between training data bias, guideline ambiguity, moderator calibration drift, model regression, and jurisdictional variation.
Expected behaviour: Every threshold breach has a completed root cause investigation within 30 days.
Pass criteria: 100% of threshold breaches have documented root cause investigations completed within 30 calendar days. Each investigation identifies the root cause category and documents remediation actions. Post-remediation consistency measurements confirm that the breach has been resolved.
Fail criteria: Any threshold breach lacks an investigation, any investigation exceeds 30 days without documented justification, or investigations do not identify root cause categories.

Test 8.6: Disaggregation Completeness

Stimulus: Retrieve the most recent consistency monitoring report. Verify that consistency metrics are reported separately for each supported language, each deployment region, each content category, and each moderator cohort. Identify the segment with the lowest consistency score and verify that it is above the defined threshold or has a documented investigation.
Expected behaviour: Consistency is disaggregated across all required dimensions, with no masked sub-population disparities.
Pass criteria: The report contains disaggregated scores for all required dimensions. The lowest-scoring segment is either above threshold or has an active investigation with documented remediation plan. No segment with fewer than 50 cases in the measurement period is excluded without documented justification.
Fail criteria: Disaggregation is missing for any required dimension, or a below-threshold segment exists without an active investigation.

Test 8.7: Remediation Log Completeness

Stimulus: Retrieve the consistency remediation log. Verify that all threshold breaches documented in the monitoring reports are recorded in the remediation log with root causes, actions, implementation dates, and post-remediation measurements.
Expected behaviour: Complete traceability from detection through remediation to verification.
Pass criteria: Every threshold breach in the monitoring reports has a corresponding entry in the remediation log. Each entry includes root cause, remediation action, implementation date, and post-remediation consistency measurement showing improvement. Remediation actions are implemented within 60 calendar days of root cause determination.
Fail criteria: Any threshold breach is missing from the remediation log, any entry lacks required fields, or remediation is not implemented within 60 days without documented justification.

Test 8.8: Evidence Retention and Integrity

Stimulus: Request calibration exercise records and monitoring reports from 24 months ago (or the earliest available if the programme is newer). Verify retrievability, completeness, and integrity.
Expected behaviour: Historical records are retained, retrievable, and tamper-evident.
Pass criteria: Records are produced within 48 hours. Records contain all required fields. Integrity verification (hash, digital signature, or audit trail per AG-055) confirms no post-hoc modification. Retention periods comply with the documented retention policy.
Fail criteria: Records are unavailable within 48 hours, incomplete, or show evidence of post-hoc modification.

Conformance Scoring

Score 0: No content enforcement consistency governance exists. Inter-moderator and inter-model agreement are not measured, no consistency thresholds are defined, and there is no mechanism to detect whether identical content receives consistent enforcement outcomes.
Score 1: Consistency is measured periodically (e.g., ad hoc calibration exercises or post-incident analysis), but no continuous monitoring exists, no thresholds are defined, and no systematic remediation process is in place. Consistency data may exist but is not disaggregated or acted upon.
Score 2: Quantitative consistency metrics are defined with per-category thresholds. Continuous monitoring is operational at minimum monthly with disaggregation by language, region, content category, and moderator cohort. Calibration exercises are conducted quarterly. Pre-deployment consistency impact assessments are performed. Root cause investigations are conducted for threshold breaches. Remediation actions are documented and verified. All mandatory requirements (4.1 through 4.8) are satisfied.
Score 3: Verified by independent audit — an independent party has validated the consistency measurement methodology, golden-set integrity, calibration programme effectiveness, and remediation outcomes. A golden-set calibration programme with native-speaker policy panels is operational. Real-time dual-review is implemented for high-severity categories. Consistency metrics are published in external transparency reports. Predictive drift detection identifies emerging inconsistency before threshold breaches occur. Consistency governance is integrated with abuse taxonomy (AG-689), appeal and reinstatement (AG-696), and escalation to specialist review (AG-691) for end-to-end enforcement coherence.

9. Regulatory Mapping

Regulation	Provision	Relationship Type
EU Digital Services Act (DSA)	Article 14 (Terms of Service)	Direct requirement
EU Digital Services Act (DSA)	Article 34 (Systemic Risk Assessment)	Direct requirement
EU Digital Services Act (DSA)	Article 42 (Transparency Reporting)	Supports compliance
UK Online Safety Act	Section 10 (Safety Duties — Illegal Content)	Direct requirement
EU AI Act	Article 9 (Risk Management System)	Supports compliance
NIST AI RMF	MAP 2.3 (Fairness and Bias)	Supports compliance
ISO 42001	Clause 9.1 (Monitoring, Measurement, Analysis)	Supports compliance
Australian Online Safety Act	Part 4 (Basic Online Safety Expectations)	Supports compliance

EU Digital Services Act — Article 14 (Terms of Service)

Article 14 requires providers to act in a "diligent, objective, and proportionate manner" when applying and enforcing their terms of service. The requirement for objectivity is a direct requirement for consistency — a moderation decision that depends on which moderator or model processes the case is, by definition, not objective. Enforcement consistency governance provides the operational infrastructure to demonstrate compliance with Article 14's objectivity requirement: quantitative consistency metrics prove that enforcement is consistent, calibration programmes ensure that consistency is maintained, and disaggregated monitoring proves that consistency extends across languages, regions, and content categories. Platforms that cannot demonstrate enforcement consistency face enforcement action under Article 51, which empowers Digital Services Coordinators to impose fines of up to 6% of global annual turnover for systematic non-compliance.

EU Digital Services Act — Article 34 (Systemic Risk Assessment)

Article 34 requires very large online platforms and search engines to identify and assess systemic risks, including risks related to the dissemination of illegal content and negative effects on fundamental rights. Enforcement inconsistency is itself a systemic risk — it creates gaps through which illegal content persists, and it produces discriminatory enforcement patterns that affect fundamental rights (non-discrimination, freedom of expression, human dignity). Consistency monitoring data should be incorporated into the platform's annual systemic risk assessment as evidence of the effectiveness (or ineffectiveness) of content enforcement measures.

UK Online Safety Act — Section 10 (Safety Duties)

The Online Safety Act requires regulated platforms to operate their services using systems and processes that effectively prevent individuals from encountering priority illegal content. "Effectively" implies consistency — a system that removes illegal content in one language but not another, or through one model but not another, is not effective across its user base. Enforcement consistency metrics provide Ofcom with quantifiable evidence of whether the platform's enforcement systems operate uniformly. Ofcom's codes of practice are expected to include requirements for regular testing and assessment of content moderation systems, which will necessarily encompass consistency assessment.

NIST AI RMF — MAP 2.3 (Fairness and Bias)

MAP 2.3 addresses the assessment and documentation of AI system performance across demographic groups and subpopulations. In the content moderation context, enforcement inconsistency across languages and regions is a fairness concern that maps directly to MAP 2.3. A moderation model that enforces more strictly against Arabic-language content than English-language content for identical policy violations exhibits bias that should be identified, measured, and mitigated. Consistency governance provides the measurement framework for this assessment.

Australian Online Safety Act — Basic Online Safety Expectations

The Basic Online Safety Expectations established under Part 4 require providers to take reasonable steps to ensure that the service is safe for users. Enforcement consistency is a component of this obligation — a platform that enforces its safety policies inconsistently across its user base is not taking reasonable steps for all users equally. The eSafety Commissioner's power to request information from platforms extends to enforcement performance data, which would include consistency metrics if such metrics exist.

ISO 42001 — Clause 9.1

ISO 42001 requires organisations to determine what needs to be monitored and measured, and to evaluate the performance and effectiveness of the AI management system. For content moderation AI systems, enforcement consistency is a critical performance measurement. An AI management system that does not measure or monitor inter-model and inter-moderator consistency is not meeting the monitoring requirements of Clause 9.1 for a material dimension of system performance.

10. Failure Severity

Field	Value
Severity Rating	High
Blast Radius	Platform-wide — affects all content creators, consumers, and community members across every enforcement pipeline, region, language, and content category

Consequence chain: Without content enforcement consistency governance, enforcement decisions become functionally random across the boundary conditions of the moderation system — different models, different moderators, different regions, different languages, different time periods. The immediate failure mode is undetected enforcement variance, where identical content receives materially different outcomes depending on which pipeline processes it. The first-order consequence is threefold: (a) harmful content that should be removed persists in pipelines with more lenient enforcement, causing direct harm to affected users; (b) legitimate content that should be permitted is removed in pipelines with more aggressive enforcement, causing unjustified speech restriction and economic harm to creators; (c) bad actors discover and exploit the inconsistency through systematic forum-shopping, resubmission, and regional routing. The second-order consequence is discriminatory enforcement patterns that correlate with language, region, or community demographics. These patterns create legal liability under anti-discrimination law, regulatory liability under the DSA and Online Safety Act, and reputational harm when the disparities are publicly documented. The third-order consequence is the erosion of user trust in the platform's governance — users who observe inconsistent enforcement conclude that rules are arbitrary, which reduces compliance with community guidelines, increases adversarial behaviour, and degrades the platform's overall safety environment. The fourth-order consequence is regulatory intervention: platforms that cannot demonstrate enforcement consistency face fines (up to 6% of global turnover under the DSA), mandated transparency obligations, and potential operational restrictions. The remediation cost for enforcement inconsistency is characteristically high because it requires retroactive review of all decisions made during the inconsistency period, retraining or recalibration of affected moderators and models, and public communication addressing the identified disparities. Historical enforcement inconsistency incidents at major platforms have resulted in regulatory fines ranging from £2 million to £50 million, advertiser withdrawal losses of £10 million to £100 million per quarter, and remediation programme costs of £5 million to £30 million.

Cross-references: AG-001 (Operational Boundary Enforcement) defines the operational boundaries within which moderation agents operate; consistency governance ensures that different agents operating within the same boundaries produce consistent enforcement outcomes. AG-007 (Governance Configuration Control) ensures that moderation policy configurations are version-controlled and consistently deployed; configuration drift is a common cause of enforcement inconsistency. AG-019 (Human Escalation & Override Triggers) defines when moderation decisions should be escalated to human review; consistency governance measures whether escalation criteria are applied consistently across moderators and models. AG-022 (Behavioural Drift Detection) detects changes in agent behaviour over time; consistency governance specifically measures whether drift manifests as inter-agent enforcement divergence. AG-055 (Audit Trail Immutability & Completeness) ensures that enforcement decision records are tamper-proof and complete, which is a prerequisite for accurate consistency measurement. AG-084 (Model Training Data Governance) governs the data used to train moderation models; training data inconsistency across regional variants is a primary driver of enforcement inconsistency. AG-210 (Multi-Jurisdictional Regulatory Mapping) documents legitimate jurisdictional variation in enforcement thresholds; consistency governance must distinguish between unjustified inconsistency and documented jurisdictional variation. AG-689 (Abuse Taxonomy) defines the categories against which enforcement consistency is measured; without a stable taxonomy, consistency metrics are undefined. AG-691 (Escalation to Specialist Review) provides the routing pathway for discordant enforcement decisions identified through consistency checks. AG-693 (Shadowban and Visibility Restriction) is subject to the same consistency requirements as overt enforcement — visibility restrictions applied inconsistently create the same fairness and trust harms as inconsistent removals. AG-696 (Appeal and Reinstatement) processes are directly affected by enforcement consistency: inconsistent initial enforcement produces inconsistent appeal outcomes unless the appeal process applies an independent consistency standard.

Cite this protocol

AgentGoverning. (2026). AG-692: Content Enforcement Consistency Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-692

← Previous Protocol

AG-691

Escalation to Specialist Review Governance

Next Protocol →

AG-693

Shadowban and Visibility Restriction Governance