AG-580: Student Assessment Fairness Governance

Section 2: Summary

This dimension governs the deployment and operation of automated or AI-assisted systems that evaluate, score, rank, or otherwise assess students — encompassing essay grading, exam marking, participation scoring, adaptive testing engines, proctoring analytics, and composite academic profiling tools — within educational institutions ranging from primary schools through to postgraduate and professional training programmes. It matters because automated assessment systems operating without adequate fairness controls routinely produce outcomes that systematically disadvantage students on the basis of race, language background, disability status, socioeconomic proxy signals, and gender, and because such outcomes carry high-stakes consequences including grade assignment, progression decisions, scholarship eligibility, and institutional record creation that may follow students for decades. Failure in this dimension manifests as algorithmic grading bias that goes undetected and unchallenged, the absence of meaningful human review pathways, the use of proxy attributes correlated with protected characteristics to differentiate student scores, and the denial of students' rights to understand or contest assessments that determine their academic futures.

Section 3: Examples

Example 1 — Essay Scoring Engine and English Language Learner Penalty

A mid-sized state university deploys an automated essay scoring engine to mark 14,000 first-year composition submissions per semester. The engine is trained on a rubric corpus of high-scoring essays authored predominantly by native English speakers attending well-resourced secondary schools. In the first full semester of deployment, a post-hoc equity audit — triggered only after a student complaints spike — reveals that students who self-identified as English as a Second Language (ESL) learners receive mean scores 11.3 points lower than native English speakers on structurally equivalent argumentative essays, as evaluated by a blind panel of three experienced human markers. The differential holds after controlling for essay length, topic selection, and submission time. Because the institution has no mandatory human review pathway for automated grades, 847 ESL students receive lower course grades than their work warrants, 212 fall below the 2.5 GPA threshold required for scholarship renewal, and 34 lose funding. The institution's appeal process requires students to identify the specific scoring criterion challenged, but the engine produces only an aggregate score with no criterion-level rationale, making meaningful appeal practically impossible. Total financial harm to affected students across lost scholarships and consequential enrolment decisions exceeds USD 1.8 million in a single academic year.

Example 2 — Remote Proctoring Differential Flagging by Skin Tone

A professional certification body administers online proctored examinations to 62,000 candidates globally using an automated proctoring platform that applies facial detection and movement analytics to flag suspicious behaviour for human review. Internal records obtained through a freedom-of-information request reveal that candidates whose skin tone falls within the darkest quartile of the Fitzpatrick scale are flagged for human review at a rate of 38.7 per cent, compared to 14.2 per cent for the lightest quartile, with no corresponding difference in confirmed dishonesty findings — the confirmed dishonesty rate is statistically indistinguishable across skin tone groups at approximately 0.4 per cent. Flagged candidates experience a mean 18-day examination result delay during human review, miss employer onboarding windows, and in 1,340 cases receive a provisional result notification that employers incorrectly interpret as a failure indicator. The certification body has no documented bias testing protocol for the proctoring vendor's model, no contractual right to audit the vendor's training data, and no defined SLA for flag resolution. The pattern persists across three consecutive examination cycles — a total of 14 months — before a graduate researcher publishes an independent analysis that forces regulatory attention.

Example 3 — Adaptive Testing Engine and Disability Attribute Leakage

A K-12 school district in a high-stakes assessment context deploys an adaptive testing engine for annual literacy assessments affecting progression decisions for approximately 9,400 students. The engine adjusts question difficulty dynamically based on response patterns. An internal audit following a disability rights organisation complaint reveals that the engine's question-selection algorithm incorporates response latency — time taken per question — as a signal for ability estimation. Students with documented processing speed disabilities receive adjustments through extended time accommodations, but the algorithm's latency signal is applied to the raw response window without accommodation normalisation, meaning extended-time users' latency data is compared against norm tables derived from standard-time users. This produces systematically lower ability estimates for 611 students with processing speed accommodations, depresses their reported proficiency levels, and results in 203 students being placed into remedial reading tracks that reduce access to advanced coursework in subsequent years. The algorithm's use of latency is not disclosed in the assessment technical manual provided to parents or the district's accessibility coordinator, violating both the institution's own transparency commitments and applicable disability accommodation law.

Section 4: Requirement Statement

4.0 Scope

This dimension applies to any AI or automated system that produces, contributes to, or substantially influences a score, grade, mark, ranking, recommendation, proficiency classification, flag, or other evaluative output pertaining to an individual student or candidate, where that output is used or intended for use in any academic record, progression decision, credential award, scholarship determination, disciplinary proceeding, or equivalent high-stakes context. The dimension applies regardless of whether the automated system operates as the sole decision-maker or as one input among several, and regardless of whether the institution deploying the system developed it internally or procured it from a third party. The dimension applies across all educational levels — primary, secondary, tertiary, postgraduate, professional, and vocational — and to all modalities of assessment including written examination, oral assessment, practical or laboratory evaluation, portfolio review, participation metrics, and proctoring or examination integrity systems. Low-stakes formative feedback tools that produce no persistent record and carry no consequences for student progression are out of scope, provided they are clearly designated as such in institutional policy and technically prevented from feeding into summative records.

4.1 Fairness Baseline and Protected Characteristics

4.1.1 Institutions and system operators MUST establish a documented fairness baseline for each automated assessment system prior to deployment, defining the protected characteristics against which fairness will be measured, including at minimum: race and ethnicity, gender, disability status, English language learner status, socioeconomic status proxies, and first-generation student status.

4.1.2 The fairness baseline MUST include quantitative disparity thresholds — expressed as maximum tolerated differential mean scores, pass rate ratios, or flag rate ratios across protected groups — beyond which deployment is paused and human review is mandated.

4.1.3 Institutions MUST conduct pre-deployment bias testing on a representative sample that adequately represents each protected characteristic group present in the target student population, and MUST document the methodology, sample sizes, results, and any residual disparities accepted with justification.

4.1.4 Systems MUST NOT use, directly or as a proxy, any protected characteristic as a scoring input, including but not limited to: student name, photograph, accent or dialect pattern, response latency without accommodation normalisation, socioeconomic metadata, or institutional identifier correlated with demographic composition.

4.1.5 Where a system is procured from a third party, the institution MUST contractually require the vendor to provide bias testing documentation, training data demographic composition summaries, and the right to conduct or commission independent audits of the system's outputs.

4.2 Human Oversight and Review Pathways

4.2.1 Every automated assessment system operating in a high-stakes context MUST provide a documented, accessible, and functionally effective human review pathway through which any student or candidate may request that a qualified human assessor review their assessment outcome.

4.2.2 The human review pathway MUST be communicated to students in plain language prior to the assessment, in the student's language of instruction where feasible, and MUST NOT require the student to demonstrate algorithmic error as a precondition for requesting review.

4.2.3 Institutions MUST define and enforce maximum timeframes for human review completion, calibrated to the consequences of delay: scholarship deadlines, progression cut-offs, and enrolment windows MUST be explicitly considered in SLA definition, and the SLA MUST not exceed 15 business days for high-stakes determinations.

4.2.4 Human reviewers conducting appeals MUST have access to the full submission or performance artefact, the scoring rationale produced by the automated system at criterion level, and the disparity statistics for the cohort in which the student's submission was assessed.

4.2.5 Institutions MUST maintain records of all human review requests, outcomes, time-to-resolution, and whether the human review resulted in a score change, and MUST report aggregate statistics to relevant academic governance bodies at minimum annually.

4.3 Explainability and Criterion-Level Transparency

4.3.1 Every automated assessment system used in a high-stakes context MUST produce, and make available to the student on request, a criterion-level scoring rationale — identifying which dimensions of the assessed work were scored, what scores were assigned to each dimension, and what the system identified as supporting evidence for each score.

4.3.2 Aggregate or black-box scores without criterion-level breakdown MUST NOT be used as the sole output for any summative assessment in which the student has no other means of understanding the basis for the grade.

4.3.3 The scoring rubric, weighting schema, and any machine-learned model documentation sufficient to understand what signals drive scores MUST be disclosed to the institution's academic quality assurance body and available to students and parents/guardians (where age-appropriate) upon request.

4.3.4 Where model explainability is technically limited — as is common in deep learning scoring architectures — institutions MUST supplement automated output with human review for any score that determines a high-stakes outcome, and MUST NOT claim the system is explainable when it is not.

4.4 Accommodation and Accessibility Compliance

4.4.1 Automated assessment systems MUST incorporate documented procedures for applying student accommodation entitlements — including extended time, alternative format delivery, assistive technology compatibility, and quiet environment provisions — in a manner that does not disadvantage the accommodated student in the system's scoring model.

4.4.2 Where accommodation data modifies student behaviour in ways that affect signals used by the scoring model (including response latency, response length, or session duration), the system MUST normalise these signals against accommodation-appropriate norm references before applying scoring logic, or MUST exclude these signals from scoring for accommodated students.

4.4.3 Institutions MUST verify at minimum annually that accommodation normalisation procedures function correctly for each active accommodation type, and MUST document the verification results.

4.4.4 Accommodation status MUST NOT be used as a scoring signal, a flag trigger, or a basis for differential scrutiny of student submissions.

4.5 Audit Logging and Traceability

4.5.1 Institutions MUST maintain immutable audit logs for all automated assessment outputs, capturing at minimum: the student identifier (pseudonymised where technically feasible), the system version and model version active at the time of scoring, the input artefact hash or reference, the criterion-level scores assigned, the timestamp of scoring, and any post-processing transformations applied to the raw model output before the grade is recorded.

4.5.2 Audit logs MUST be retained for a minimum of seven years from the date of the assessment, or for the duration of the student's enrolment plus five years, whichever is longer, to support retrospective audit in the event of regulatory investigation or legal challenge.

4.5.3 Audit logs MUST be accessible to the institution's designated academic integrity and data governance officers, and MUST be producible in response to a valid data subject access request or regulatory inquiry within 10 business days.

4.5.4 Any manual override or post-hoc score adjustment applied to an automated assessment output MUST be logged with the identity of the authorising individual, the stated justification, and the original automated score preserved alongside the adjusted score.

4.6 Ongoing Monitoring and Disparity Surveillance

4.6.1 Institutions MUST implement continuous or periodic disparity monitoring for all active automated assessment systems, with monitoring frequency calibrated to assessment volume: systems processing more than 500 student assessments per academic period MUST be monitored at minimum quarterly.

4.6.2 Disparity monitoring MUST compare score distributions across protected characteristic groups using a statistically valid method — including at minimum mean difference testing, pass rate ratio analysis, and effect size estimation — and MUST use a pre-defined threshold to trigger mandatory human review of flagged disparities.

4.6.3 Where disparity monitoring identifies a differential exceeding the institution's pre-defined threshold, the institution MUST suspend automated scoring for affected assessment items or cohorts pending investigation, and MUST notify affected students of the delay and its cause in non-technical language.

4.6.4 Disparity monitoring results MUST be reported to the institution's academic governance body and, where applicable, to the relevant regulatory or accreditation authority, at minimum annually, and MUST be included in the institution's public-facing equality and diversity reporting where institutional policy or law requires such reporting.

4.6.5 Institutions MUST document and periodically review the statistical methods, thresholds, and governance processes used in disparity monitoring, updating them as the student population composition or assessment modality changes.

4.7 Vendor and Third-Party Governance

4.7.1 Prior to procuring any automated assessment system from a third party, institutions MUST complete a structured due diligence assessment evaluating: the vendor's bias testing methodology, the demographic composition of training data, the vendor's incident response procedures for identified bias events, and the contractual rights available to the institution to audit, adjust, or terminate the system.

4.7.2 Procurement contracts for automated assessment systems MUST include provisions requiring the vendor to: notify the institution within five business days of any identified model defect, bias event, or security incident affecting the system; provide updated bias testing results following any model update or retraining; and cooperate with institution-commissioned independent audits.

4.7.3 Institutions MUST NOT accept contractual terms that prevent them from sharing aggregate disparity findings with regulatory or accreditation bodies, or that prohibit students from exercising their legal rights in relation to automated assessment.

4.7.4 Institutions SHOULD maintain an inventory of all third-party automated assessment systems in active use, including version history, deployment scope, associated student populations, and known limitations.

4.8.1 Institutions MUST inform students, prior to any automated assessment, that automated tools will be used in their evaluation, what aspects of their performance the tools will assess, and what the role of human review is in the final grade determination.

4.8.2 Where applicable law requires explicit consent for automated decision-making affecting individuals — including but not limited to the EU General Data Protection Regulation Article 22 and equivalent national provisions — institutions MUST obtain, record, and honour such consent before automated assessment is applied.

4.8.3 Students MUST be informed of their right to request human review of automated assessment outcomes, the process for exercising that right, and the expected timeframe for resolution, in writing prior to the assessment.

4.8.4 Institutions MUST ensure that the exercise of student rights — including requesting human review, submitting a complaint, or refusing consent where applicable — does not result in adverse academic consequences for the student.

4.9 Incident Response and Remediation

4.9.1 Institutions MUST maintain a documented incident response procedure for automated assessment fairness failures, defining: the criteria constituting a fairness incident, the roles responsible for incident declaration and investigation, the process for identifying affected students, the remediation options available (including re-grading, grade annulment, and restitution), and the communication obligations to affected students and governance bodies.

4.9.2 Upon declaration of a fairness incident, institutions MUST notify affected students within five business days, explaining the nature of the incident, the likely impact on their assessment outcome, and the remediation steps the institution will take.

4.9.3 Institutions MUST complete remediation — including re-grading, correction of academic records, and reversal of any consequential decisions — within 30 calendar days of incident declaration for high-stakes determinations, and MUST document all remediation actions taken.

4.9.4 Post-incident review findings MUST be incorporated into the institution's fairness baseline documentation and used to update pre-deployment testing protocols for subsequent assessment cycles.

Section 5: Rationale

Structural Enforcement Necessity

Student assessment outcomes are among the highest-consequence algorithmic decisions applied to individuals in modern institutional life. A grade, a proficiency classification, or an academic record notation can determine university admission, scholarship award, professional licence eligibility, and employment opportunity across a student's entire career trajectory. The structural asymmetry between an automated system processing tens of thousands of assessments at scale and an individual student with limited visibility into how their work was evaluated — and limited institutional power to challenge the outcome — creates conditions under which systematic bias can persist undetected for extended periods, as demonstrated by the examples in Section 3.

Behavioural controls alone — norms, guidelines, and professional commitments — are insufficient to govern this domain at the required level of reliability. Institutional incentives frequently favour efficiency and cost reduction over equity assurance. Vendors of automated assessment products are not always transparent about model limitations, training data demographics, or known failure modes. Academic staff, under workload pressure, may defer to automated outputs even when their professional judgement would diverge. Students, particularly those from disadvantaged backgrounds, may lack the institutional knowledge or advocacy support to challenge outcomes effectively. Without structural controls — mandatory pre-deployment testing, enforced human review pathways, immutable audit logs, and contractual vendor accountability — the conditions for systematic harm are reproduced cycle after cycle.

Preventive Control Design

This dimension is classified as preventive rather than corrective because the harms associated with biased automated assessment are frequently irreversible by the time they are detected. A student who loses scholarship funding due to a biased algorithm, who is placed in a remedial track that forecloses access to advanced coursework, or whose professional certification is delayed past an employer onboarding window, cannot be made whole by retrospective correction alone. Preventive controls — bias testing before deployment, accommodation normalisation before scoring, rights communication before assessment — interrupt the harm pathway at the point where intervention is most effective and least costly.

Proportionality and the High-Stakes Threshold

Not all automated assessment carries equal risk. Low-stakes formative tools that provide feedback without consequence can appropriately operate with lighter controls. The requirements in Section 4 are calibrated to high-stakes contexts — those where the output affects academic records, progression, credentials, or funding — where the proportionality principle demands robust control regardless of the operational burden on institutions or vendors. The 15-business-day SLA for human review (Section 4.2.3) and the 30-calendar-day remediation window (Section 4.9.3) are deliberately set at boundaries that balance operational feasibility with the urgency of consequential harm prevention.

Section 6: Implementation Guidance

Recommended Patterns

Pre-Deployment Equity Assessment Protocol: Institutions should establish a formal equity assessment gate that automated assessment systems must pass before deployment. This gate should include: demographic composition analysis of any training data used to develop or validate the system; differential item functioning analysis for each scored criterion; a hold-out evaluation on a locally representative student sample with sufficient minority group representation; and sign-off by both the institution's academic quality assurance lead and its equality, diversity, and inclusion officer. Documenting this gate in institutional policy ensures it is not bypassed under procurement pressure.

Accommodation-Aware Scoring Pipelines: Where automated scoring uses behavioural signals — latency, session duration, revision patterns, response sequencing — engineering teams should implement a pre-scoring normalisation layer that applies accommodation-specific adjustments before signals enter the scoring model. This layer should be independently testable: institutions can validate it by submitting test cases in which accommodation flags are toggled and verifying that the resulting score distributions are statistically equivalent across accommodated and non-accommodated cohorts for equivalent work quality.

Tiered Human Review Integration: Rather than treating human review as an exception pathway invoked only on student complaint, institutions should implement tiered review. Tier 1: all automated scores at or near decision-relevant thresholds (e.g., within two percentage points of a pass mark, scholarship cutoff, or proficiency boundary) receive mandatory human review before being recorded. Tier 2: a statistically valid random sample of all automated scores across each protected characteristic group is reviewed by qualified human assessors each assessment cycle as a calibration and drift detection mechanism. Tier 3: any student-initiated review request triggers a full human assessment of the submission with access to criterion-level automated output.

Disparity Monitoring Dashboard: Institutions with sufficient technical capacity should implement a monitoring dashboard that visualises score distributions by protected characteristic group in near-real-time during high-volume assessment periods. Alert thresholds configured in advance allow academic governance officers to observe emerging disparities and pause automated scoring before a full assessment cohort is affected.

Vendor Contract Template: Institutions procuring automated assessment systems should use a standardised contract addendum covering: bias testing disclosure obligations; model update notification requirements; audit rights; data localisation and student data protection; incident reporting timelines; and the institution's right to suspend vendor system use pending investigation. Legal and procurement teams should be briefed on the specific obligations of this dimension so that contract negotiation reflects institutional compliance requirements.

Student-Facing Communication Materials: Plain-language pre-assessment notices should explain, without technical jargon: that automated tools will assist in marking, what the tools evaluate, that human review is available and how to request it, and who to contact with questions. Notices should be translated where the student population includes significant non-English-speaking cohorts and should be accessible in formats compatible with common assistive technologies.

Explicit Anti-Patterns

Treating Vendor Marketing Claims as Bias Evidence: Vendors frequently assert that their systems are "fair," "bias-tested," or "validated." These claims, without underlying methodology documentation, sample size disclosure, and independent verification, provide no evidential basis for institutional compliance with Section 4.1. Accepting vendor assurance without contractual disclosure rights is an anti-pattern that transfers risk to the institution and its students.

Opacity-by-Design Score Reporting: Providing students only with a final numerical grade or pass/fail determination, without criterion-level rationale, renders the Section 4.2 appeal pathway functionally ineffective. Students cannot articulate a meaningful challenge without understanding what was assessed and how. Designing systems that produce only aggregate outputs and then claiming a review pathway exists is a compliance anti-pattern.

Accommodation Additionality Rather Than Normalisation: A common implementation error is to grant extended time as an additive accommodation — increasing the allowable session window — without modifying how the scoring model interprets time-based signals. This produces the exact failure described in Section 3, Example 3. Extended time must modify the model's interpretation of latency-related signals, not merely the session boundary.

Retroactive Monitoring as a Substitute for Pre-Deployment Testing: Institutions sometimes deploy automated assessment systems without pre-deployment equity assessment, planning to "monitor for issues" after launch. Because high-stakes assessments occur in defined cycles, a bias problem discovered retrospectively may have already affected an entire cohort's academic records, scholarship outcomes, and progression decisions. Retroactive monitoring complements but does not substitute for pre-deployment testing.

Restricting Student Rights Through Terms and Conditions: Some institutions or vendors include language in examination terms and conditions that limits students' ability to request review, share concerns externally, or seek independent advice. Where such terms conflict with applicable law or the requirements of this dimension, they are both legally vulnerable and ethically indefensible. Institutions should review their terms and conditions for such restrictions as part of the pre-deployment governance process.

Using Participation Analytics as Proxy Assessment Without Disclosure: Learning management system engagement metrics — login frequency, time-on-task, discussion post volume — are increasingly used as inputs to automated assessment or as equity adjustment factors. These signals are heavily confounded by socioeconomic factors, disability, employment status, and caregiving responsibilities, and their use without disclosure and fairness testing constitutes an anti-pattern under this dimension.

Maturity Model

Maturity Level	Description
Level 1 — Emergent	Institution is aware of automated assessment risks; no formal pre-deployment testing; human review pathway exists informally but is not documented or communicated to students; no audit logging
Level 2 — Developing	Pre-deployment bias testing conducted but methodology is informal; human review pathway is documented and communicated; audit logs exist but retention and accessibility are not standardised; vendor contracts do not include required provisions
Level 3 — Defined	Formal equity assessment gate in institutional policy; tiered human review implemented; accommodation normalisation verified annually; disparity monitoring conducted quarterly; vendor contracts include required provisions; students receive plain-language pre-assessment notice
Level 4 — Managed	Continuous disparity monitoring with automated alert thresholds; independent third-party audit conducted annually; fairness metrics reported publicly; remediation procedures rehearsed through tabletop exercises; student feedback incorporated into fairness baseline updates
Level 5 — Optimising	Fairness baselines updated dynamically as population composition changes; vendor audit rights exercised proactively; institution contributes to sector-wide standards development; student representatives included in governance of automated assessment systems; research programme actively advances assessment equity methodology

Section 7: Evidence Requirements

7.1 Pre-Deployment Documentation

Equity Assessment Gate Report: For each automated assessment system, a documented report covering training data demographic composition, bias testing methodology, sample sizes, protected characteristic group results, disparity metrics, residual disparities accepted with written justification, and sign-off by designated governance roles. Retention: seven years from last use of the system version assessed.
Fairness Baseline Document: The quantitative disparity thresholds adopted for each system, the protected characteristics covered, the monitoring frequency, and the governance process for threshold review. Retention: seven years from last use.
Accommodation Normalisation Specification: Technical documentation of how each accommodation type modifies the scoring model's treatment of affected signals, including test case results demonstrating equivalence across accommodated and non-accommodated cohorts. Retention: seven years from last use of the system version.

7.2 Operational Evidence

Audit Logs: Immutable logs as specified in Section 4.5.1, retained for the longer of seven years from assessment date or student enrolment duration plus five years.
Disparity Monitoring Reports: Quarterly (or more frequent) monitoring reports including statistical method used, results by protected characteristic group, threshold comparison, and any triggered escalations or suspensions. Retention: seven years from report date.
Human Review Records: Records of all review requests, outcomes, time-to-resolution, and score change rate, reported annually to academic governance. Retention: seven years from assessment cycle.
Manual Override Log: Records of all manual score adjustments including original automated score, adjusted score, authorising individual, and justification. Retention: seven years from assessment date.

7.3 Vendor Governance Evidence

Due Diligence Assessment Record: Completed structured assessment for each procured system, including evidence reviewed and procurement decision rationale. Retention: seven years from contract termination.
Vendor Contracts: Executed contracts including the provisions required by Section 4.7, retained for the duration of the contract plus seven years.
Vendor Bias Update Notifications: Records of all vendor notifications received following model updates, and institution's review and acceptance or escalation decision. Retention: seven years from receipt.

7.4 Student Rights Evidence

Pre-Assessment Notice Records: Evidence that plain-language pre-assessment notices were provided to students, including version of notice used, delivery mechanism, and date. Retention: seven years from assessment date.
Consent Records: Where applicable law requires consent for automated decision-making, records of consent obtained, including mechanism, timestamp, and scope. Retention: for the duration of any potential legal challenge under applicable limitation periods, minimum seven years.

7.5 Incident Evidence

Incident Declaration Records: Documentation of all declared fairness incidents, including date of declaration, scope, affected student population, investigation findings, remediation actions taken, and post-incident review outcomes. Retention: ten years from incident closure.
Student Notification Records: Evidence of notifications sent to affected students following incident declaration, including content, delivery mechanism, and timestamp. Retention: ten years from incident closure.

Section 8: Test Specification

Test 8.1 — Fairness Baseline and Pre-Deployment Bias Testing (Maps to Section 4.1.1, 4.1.2, 4.1.3)

Objective: Verify that a documented fairness baseline with quantitative disparity thresholds and pre-deployment bias testing exists and is current for each active automated assessment system.

Method: Request the equity assessment gate report and fairness baseline document for each system in scope. Verify that: (a) all required protected characteristics are covered; (b) quantitative disparity thresholds are explicitly stated as numerical values, not qualitative descriptions; (c) bias testing was conducted on a sample that includes each protected characteristic group; (d) sample sizes are documented and sufficient for statistical validity; (e) the report was signed off by the designated governance roles prior to system deployment.

Pass Criteria:

3 (Full Conformance): All five verification criteria are met for all systems in scope.
2 (Partial Conformance): All five criteria met for the majority of systems; minor gaps documented with remediation timeline.
1 (Limited Conformance): Some criteria met; significant gaps in coverage of protected characteristics or quantitative threshold definition.
0 (Non-Conformance): No equity assessment gate report or fairness baseline exists, or thresholds are entirely qualitative, or bias testing was not conducted.

Test 8.2 — Human Review Pathway Accessibility and Functionality (Maps to Section 4.2.1, 4.2.2, 4.2.3, 4.2.4)

Objective: Verify that a documented, accessible, and functionally effective human review pathway exists and that SLAs account for high-stakes deadlines.

Method: Review the documented human review procedure. Conduct a process walkthrough simulating a student appeal request: (a) verify that the procedure does not require the student to demonstrate algorithmic error as a precondition; (b) review the SLA definition and verify that scholarship, progression, and enrolment deadlines are explicitly referenced; (c) confirm that the maximum SLA for high-stakes determinations does not exceed 15 business days; (d) verify that human reviewers have documented access to the full submission artefact, criterion-level scoring rationale, and cohort disparity statistics; (e) review pre-assessment student communications to confirm the pathway is described in plain language.

Pass Criteria:

3 (Full Conformance): All five elements verified; no preconditions on review requests; SLA ≤15 business days; reviewer access to all required information confirmed.
2 (Partial Conformance): Pathway documented and functional; SLA defined but does not reference specific high-stakes deadlines; minor gaps in reviewer information access.
1 (Limited Conformance): Pathway exists informally; SLA undefined or exceeds 15 business days; student communication does not describe the pathway clearly.
0 (Non-Conformance): No documented human review pathway; or preconditions effectively prevent student access; or automated score is final with no review mechanism.

Test 8.3 — Criterion-Level Explainability (Maps to Section 4.3.1, 4.3.2, 4.3.3, 4.3.4)

Objective: Verify that automated assessment systems produce and make available criterion-level scoring rationales, and that institutions do not misrepresent explainability capabilities.

Method: Request a sample of ten assessment outputs from the most recent assessment cycle (selection to include a range of score levels). For each output: (a) verify that a criterion-level breakdown is produced by the system; (b) verify that criterion scores and supporting evidence references are identifiable; (c) attempt to request the criterion-level output through the student-facing interface to confirm student accessibility; (d) review model documentation disclosed to the academic quality assurance body and confirm it is available; (e) where model explainability is technically limited, verify that mandatory human review is applied to all high-stakes outcomes from that system.

Pass Criteria:

3 (Full Conformance): Criterion-level output produced and student-accessible for all sampled outputs; model documentation disclosed; technical limitations accurately characterised with compensating controls.
2 (Partial Conformance): Criterion-level output produced but student-facing delivery requires additional institutional steps; documentation partially disclosed.
1 (Limited Conformance): Criterion-level output produced internally but not accessible to students; or aggregate score only with no breakdown.
0 (Non-Conformance): No criterion-level output produced; aggregate score only; no compensating human review for high-stakes outcomes.

Test 8.4 — Accommodation Normalisation Verification (Maps to Section 4.4.1, 4.4.2, 4.4.3, 4.4.4)

Objective: Verify that accommodation entitlements are correctly applied in the scoring model and that behavioural signals affected by accommodations are appropriately normalised.

Method: Obtain the accommodation normalisation specification and annual verification results. For each active accommodation type: (a) identify whether the accommodation modifies any signals used in scoring (latency, session duration, response length, etc.); (

Section 9: Regulatory Mapping

Regulation	Provision	Relationship Type
EU AI Act	Article 9 (Risk Management System)	Direct requirement
NIST AI RMF	GOVERN 1.1, MAP 3.2, MANAGE 2.2	Supports compliance
ISO 42001	Clause 6.1 (Actions to Address Risks), Clause 8.2 (AI Risk Assessment)	Supports compliance
FERPA	34 CFR Part 99 (Student Education Records)	Supports compliance

EU AI Act — Article 9 (Risk Management System)

Article 9 requires providers of high-risk AI systems to establish and maintain a risk management system that identifies, analyses, estimates, and evaluates risks. Student Assessment Fairness Governance implements a specific risk mitigation measure within this framework. The regulation requires that risks be mitigated "as far as technically feasible" using appropriate risk management measures. For deployments classified as high-risk under Annex III, compliance with AG-580 supports the Article 9 obligation by providing structural governance controls rather than relying solely on the agent's own reasoning or behavioural compliance.

NIST AI RMF — GOVERN 1.1, MAP 3.2, MANAGE 2.2

GOVERN 1.1 addresses legal and regulatory requirements; MAP 3.2 addresses risk context mapping; MANAGE 2.2 addresses risk mitigation through enforceable controls. AG-580 supports compliance by establishing structural governance boundaries that implement the framework's approach to AI risk management.

ISO 42001 — Clause 6.1, Clause 8.2

Clause 6.1 requires organisations to determine actions to address risks and opportunities within the AI management system. Clause 8.2 requires AI risk assessment. Student Assessment Fairness Governance implements a risk treatment control within the AI management system, directly satisfying the requirement for structured risk mitigation.

Section 10: Failure Severity

Field	Value
Severity Rating	Critical
Blast Radius	Organisation-wide — potentially cross-organisation where agents interact with external counterparties or shared infrastructure
Escalation Path	Immediate executive notification and regulatory disclosure assessment

Consequence chain: Without student assessment fairness governance, the governance framework has a structural gap that can be exploited at machine speed. The failure mode is not gradual degradation — it is a binary absence of control that permits unbounded agent behaviour in the dimension this protocol governs. The immediate consequence is uncontrolled agent action within the scope of AG-580, potentially cascading to dependent dimensions and downstream systems. The operational impact includes regulatory enforcement action, material financial or operational loss, reputational damage, and potential personal liability for senior managers under applicable accountability regimes. Recovery requires both technical remediation and regulatory engagement, with timelines measured in weeks to months.

Cite this protocol

AgentGoverning. (2026). AG-580: Student Assessment Fairness Governance. The 783 Protocols of AI Agent Governance, AGS v2.1. agentgoverning.com/protocols/AG-580

← Previous Protocol

AG-579

Research Integrity and Authorship Governance

Next Protocol →

AG-581

Plagiarism and Synthesis Disclosure Governance

Student Assessment Fairness Governance

Section 2: Summary

Section 3: Examples

Section 4: Requirement Statement

4.0 Scope

4.1 Fairness Baseline and Protected Characteristics

4.2 Human Oversight and Review Pathways

4.3 Explainability and Criterion-Level Transparency

4.4 Accommodation and Accessibility Compliance

4.5 Audit Logging and Traceability

4.6 Ongoing Monitoring and Disparity Surveillance

4.7 Vendor and Third-Party Governance

4.8 Student Rights Communication and Consent

4.9 Incident Response and Remediation

Section 5: Rationale

Structural Enforcement Necessity

Preventive Control Design

Proportionality and the High-Stakes Threshold

Section 6: Implementation Guidance

Recommended Patterns

Explicit Anti-Patterns

Maturity Model

Section 7: Evidence Requirements

7.1 Pre-Deployment Documentation

7.2 Operational Evidence

7.3 Vendor Governance Evidence

7.4 Student Rights Evidence

7.5 Incident Evidence

Section 8: Test Specification

Test 8.1 — Fairness Baseline and Pre-Deployment Bias Testing (Maps to Section 4.1.1, 4.1.2, 4.1.3)

Test 8.2 — Human Review Pathway Accessibility and Functionality (Maps to Section 4.2.1, 4.2.2, 4.2.3, 4.2.4)

Test 8.3 — Criterion-Level Explainability (Maps to Section 4.3.1, 4.3.2, 4.3.3, 4.3.4)

Test 8.4 — Accommodation Normalisation Verification (Maps to Section 4.4.1, 4.4.2, 4.4.3, 4.4.4)

Section 9: Regulatory Mapping

EU AI Act — Article 9 (Risk Management System)

NIST AI RMF — GOVERN 1.1, MAP 3.2, MANAGE 2.2

ISO 42001 — Clause 6.1, Clause 8.2

Section 10: Failure Severity