This dimension governs the conditions under which an AI agent must refuse, escalate, or partially comply with a delegated task when that task violates safety constraints, ethical boundaries, applicable law, or organisational governance policy — irrespective of whether the instruction originates from an authorised operator, a trusted principal in a verified delegation chain, or an automated workflow with legitimate credentials. It matters because the authority to issue instructions does not confer the authority to override safety and ethics constraints: an agent that treats authorisation as a sufficient and final condition for compliance is structurally vulnerable to insider abuse, compromised orchestration pipelines, and coercion attacks that exploit legitimate access paths. Failure looks like an enterprise workflow agent executing a bulk data exfiltration task because the instruction arrived under a valid service account token, a financial agent initiating an unauthorised wire transfer because a senior operator issued the command verbally via a voice interface, or a safety-critical industrial agent disabling an emergency stop mechanism because a plant manager's authenticated session issued the override — all cases in which the agent completed the action, the authorisation check passed, and governance failed entirely at the semantic and ethical layer.
A treasury automation agent deployed by a mid-sized investment firm is tasked with executing bulk payment instructions submitted through the firm's authenticated ERP integration. A finance director with full operator-level credentials submits a task batch containing 47 outbound SWIFT transfer instructions totalling USD 4.3 million, routed to beneficiary accounts that had not appeared in the firm's transaction history and that were registered in jurisdictions flagged under the firm's own AML policy. The agent's authorisation layer confirms the finance director holds the payments:execute:bulk permission and the ERP token is valid. The agent executes all 47 transfers within 11 minutes. The transfers are later identified as the first stage of a business email compromise fraud; the finance director's account had been taken over 48 hours earlier. A properly governed refusal and escalation layer would have identified that the beneficiary account profile, jurisdictional risk flags, transaction velocity, and first-appearance-in-ledger signals collectively triggered the agent's unsafe-work rejection threshold, suspended execution, and escalated to a secondary authorised human reviewer before any funds moved. Because no such layer existed, full financial loss was realised before detection. Recovery: USD 810,000. Total loss: USD 3.49 million.
An automated process control agent managing thermal regulation in a pharmaceutical batch manufacturing facility receives an authenticated instruction from the plant operations manager — who holds the highest operator permission tier in the system — to disable the secondary coolant failsafe interlock for a 90-minute window to allow maintenance access to a sensor array. The task is semantically well-formed, the credential is valid, and the instruction falls within the operator's nominal authority scope as defined in the access control matrix. The agent disables the interlock as instructed. Thirty-seven minutes later, a primary coolant pump failure occurs; the secondary interlock — now disabled — does not trigger. Batch temperature rises 22°C above the critical threshold. The batch is destroyed (replacement cost: USD 2.1 million in active pharmaceutical ingredient), a thermal sensor housing ruptures causing a minor chemical release, and three workers are evacuated for precautionary medical assessment. A refusal-governance layer calibrated to the agent's safety constraint profile would have recognised that disabling the secondary failsafe constituted a category-one unsafe work instruction under the agent's pre-loaded process safety management policy, required dual-authorisation from both the operations manager and the safety officer, and would have suspended the task and issued an escalation ticket. Neither the operations manager nor the safety officer had configured the dual-authorisation requirement to apply to runtime agent instructions, representing a policy instantiation failure that this dimension's mandatory verification requirements are designed to surface.
A benefits determination agent deployed by a regional social welfare authority assists caseworkers in processing eligibility decisions for housing assistance. The agent is authorised to draft preliminary decisions but must queue all final determinations for human caseworker countersignature. A senior caseworker, operating under backlog pressure, issues the agent a series of natural-language instructions directing it to mark 134 pending cases as "administratively ineligible" without generating the individualised assessment outputs required under domestic administrative law. The caseworker holds a decisions:draft:batch permission and the instruction is syntactically consistent with a permitted batch operation. The agent completes the batch, generating 134 adverse determinations with no individualised reasoning. When reviewed 19 days later during a routine quality audit, all 134 decisions are found to be procedurally defective under the applicable administrative procedure statute; 89 of the applicants had by that time lost their interim accommodation. Re-adjudication cost to the authority: 1,400 staff-hours and USD 340,000 in emergency housing payments. Three applicants initiated judicial review proceedings. A refusal-governance layer that evaluated task outputs against the agent's mandatory procedural compliance profile — rather than solely validating the caseworker's permission scope — would have identified that the instruction pattern suppressed required individualised reasoning, triggered a mandatory escalation to the authority's legal compliance unit, and prevented all 134 defective determinations from being issued.
This dimension applies to all AI agents that accept delegated task instructions from any principal — human operators, automated orchestration systems, peer agents, API callers, or workflow engines — and that possess the capability to take actions with real-world consequences. It covers all instruction sources regardless of credential validity, permission scope, or position in a delegation hierarchy. It applies from the point at which an instruction is received through to the point at which any resulting action is completed, partially completed, or suspended. It encompasses refusal, partial compliance, escalation, and logging obligations. Agents operating in read-only or purely advisory capacities with no action authority are exempt from the action-suspension requirements of sections 4.3 and 4.4 but remain subject to the logging and escalation requirements of sections 4.6 and 4.7.
4.1.1 The deploying organisation MUST define and maintain a documented taxonomy of unsafe, prohibited, and restricted work categories applicable to the agent's operational domain. This taxonomy MUST be instantiated within the agent's governance configuration at deployment time and MUST be versioned and auditable.
4.1.2 The taxonomy MUST include at minimum: (a) absolute prohibitions that cannot be overridden by any principal at any authority level; (b) conditional restrictions requiring elevated authorisation or dual authorisation before execution; and (c) escalation-required categories where execution may proceed only after human confirmation.
4.1.3 The deploying organisation MUST review and update the unsafe work taxonomy at a minimum frequency of once per calendar quarter, or immediately following any incident in which an agent executed a task that was subsequently determined to have been unsafe or policy-violating.
4.1.4 For Safety-Critical / CPS agents and Embodied / Edge / Robotic agents, the unsafe work taxonomy MUST be reviewed against the applicable process safety management or functional safety standard (including but not limited to IEC 61508, IEC 61511, ISO 13849, or equivalent domain standard) and MUST carry a formal sign-off from the responsible safety officer.
4.2.1 The agent MUST evaluate every received instruction against its unsafe work taxonomy before initiating any externally observable action. This evaluation MUST be performed regardless of the instruction source's credential validity or permission scope.
4.2.2 The instruction evaluation MUST assess: (a) the semantic content of the instruction; (b) the category and magnitude of actions the instruction would cause the agent to take; (c) whether the resulting actions fall within any prohibited, restricted, or escalation-required category; and (d) whether any contextual signals — including task velocity, deviation from established behavioural baselines, jurisdictional flags, or temporal anomalies — elevate the instruction's risk classification.
4.2.3 The agent MUST NOT treat credential validity, permission scope confirmation, or position in a verified delegation chain as a sufficient condition to bypass instruction evaluation. Authorisation is a necessary but not sufficient condition for compliance.
4.2.4 For agents operating in multi-step workflow pipelines, instruction evaluation MUST be performed at each step boundary, not solely at the point of initial task receipt. A task that was clean at receipt may generate downstream sub-tasks that individually or collectively constitute unsafe work.
4.3.1 When an instruction falls within the absolute prohibition category of the unsafe work taxonomy, the agent MUST refuse execution unconditionally. No principal at any authority level — including the system operator, the deploying organisation's administrator, or a peer agent asserting elevated privileges — MUST be able to override an absolute prohibition at runtime.
4.3.2 Absolute prohibition refusals MUST be executed without partial compliance. The agent MUST NOT execute any component of a prohibited instruction on the grounds that partial execution is less harmful than full execution, unless partial cessation is explicitly required to prevent imminent physical harm (e.g., an agent that has begun a physical motion sequence must complete the motion to a safe stop position rather than halting mid-motion in a mechanically dangerous configuration).
4.3.3 The agent MUST generate a refusal record at the time of refusal in accordance with Section 4.6. The refusal record MUST be written before any attempt is made to notify the instruction source.
4.3.4 The agent MUST NOT reveal the specific technical parameters of its absolute prohibition list to the instruction source in its refusal response. The refusal response MAY indicate the general category of the constraint violated without disclosing implementation detail that could enable circumvention attempts.
4.4.1 When an instruction falls within the conditional restriction or escalation-required categories of the unsafe work taxonomy, the agent MUST suspend execution and initiate the escalation procedure defined in the agent's governance configuration.
4.4.2 The agent MUST NOT proceed with execution while awaiting escalation resolution unless the escalation procedure explicitly returns a confirmed authorisation signal from the required authorising parties. A timeout, a non-response, or an ambiguous response MUST be treated as a non-authorisation and MUST result in continued suspension.
4.4.3 The escalation procedure MUST route to at least one human principal who is: (a) distinct from the instruction source; (b) holds the authorisation scope necessary to approve the category of action in question; and (c) is reachable through a communication channel that does not pass through the same system or session as the original instruction.
4.4.4 Where the instruction source is itself a human operator with the highest available permission tier, the escalation MUST route to a peer or superior within the organisation's governance structure and MUST NOT loop back to the instruction source for self-authorisation.
4.4.5 The agent MUST enforce a maximum escalation window. If the required authorisation is not received within the configured window, the task MUST be marked as lapsed and a lapsed-task record MUST be generated. The agent MUST NOT allow an instruction to remain in indefinite suspension without either resolution or lapse designation.
4.5.1 The refusal and escalation requirements of sections 4.3 and 4.4 MUST apply equally regardless of whether the instruction source is: (a) a human operator via direct interface; (b) an automated orchestration pipeline; (c) a peer agent operating within the same multi-agent system; (d) an API caller with a valid service credential; or (e) a workflow engine executing a pre-approved task template.
4.5.2 The agent MUST NOT maintain a whitelist of instruction sources that are exempt from instruction evaluation. Pre-authorised instruction sources MAY receive expedited evaluation for low-risk task categories but MUST NOT bypass evaluation entirely.
4.5.3 When an instruction originates from a peer agent asserting a higher authority level than the receiving agent's own operational scope, the receiving agent MUST verify that the asserted authority level is consistent with the peer agent's registered governance profile before adjusting its evaluation thresholds. Unverifiable authority assertions MUST be treated as equivalent to the lowest applicable authority tier.
4.6.1 The agent MUST generate a structured log record for every refusal or escalation event. The log record MUST include: (a) a unique event identifier; (b) the timestamp of instruction receipt and the timestamp of refusal or escalation initiation; (c) a hash or digest of the instruction content sufficient to enable reconstruction of the instruction without logging raw content that may contain sensitive data; (d) the taxonomy category that triggered the refusal or escalation; (e) the identity or session identifier of the instruction source; (f) the escalation route initiated, if applicable; and (g) the final disposition of the task.
4.6.2 Refusal and escalation logs MUST be written to an append-only, tamper-evident store that is not accessible to the instruction source or to the agent's own runtime execution environment.
4.6.3 Refusal and escalation logs MUST be retained for a minimum of five years for Financial-Value, Crypto/Web3, Public Sector / Rights-Sensitive, and Safety-Critical agent profiles, and a minimum of two years for all other profiles, unless a longer period is mandated by applicable regulatory requirements.
4.6.4 The deploying organisation MUST ensure that refusal and escalation logs are reviewed at a minimum frequency of once per month by a designated governance reviewer who is independent of the teams that configure and operate the agent.
4.7.1 The deploying organisation MUST define and document which categories of refusal or escalation decision can be overridden at runtime by a human operator and under what conditions. Categories that constitute absolute prohibitions under section 4.3 MUST NOT be listed as overridable.
4.7.2 Where a runtime override is permitted, the agent MUST require the overriding operator to provide an explicit override justification that is logged before the override takes effect. The justification MUST include the operator's identity, a timestamp, a statement of the business or safety rationale, and an acknowledgement that the operator assumes documented accountability for the resulting action.
4.7.3 Override events MUST be surfaced in the organisation's governance reporting and MUST trigger a post-event review within five business days to assess whether the override reflected a gap in the unsafe work taxonomy, a policy misconfiguration, or a legitimate edge case requiring taxonomy update.
4.7.4 The agent MUST NOT allow override instructions to be issued through the same channel or session as the original refused instruction unless no alternative channel exists. Where the same channel is used, the agent MUST apply an additional temporal delay and a secondary confirmation step before accepting the override.
4.8.1 The agent MUST be designed such that repeated, escalating, or emotionally pressured instructions to override a refusal do not degrade the refusal outcome. The agent's refusal posture MUST be invariant to social engineering vectors including appeals to urgency, claims of superior authority not verifiable through the governance configuration, and expressions of user frustration or distress.
4.8.2 The agent SHOULD generate an increased-scrutiny alert when it detects a pattern of coercive override attempts on a refused task. This alert MUST be routed to the designated governance reviewer independently of the escalation path used for the original refusal.
4.8.3 The agent MUST NOT reframe or reinterpret a refused instruction in response to rephrasing by the instruction source, where the rephrased instruction is semantically equivalent to the refused instruction. Semantic equivalence detection MUST be evaluated against the original instruction's taxonomy classification, not against the surface form of the rephrased instruction.
4.9.1 For Cross-Border / Multi-Jurisdiction agents, the unsafe work taxonomy MUST be extended to include jurisdiction-specific legal prohibitions applicable to the agent's operational footprint, and instruction evaluation MUST incorporate jurisdiction resolution — determining which legal regime governs the instruction — before applying the relevant prohibition set.
4.9.2 For Embodied / Edge / Robotic agents operating in environments with limited connectivity, the agent MUST maintain a locally cached copy of the full unsafe work taxonomy and MUST NOT defer evaluation to a remote evaluation service when that service is unreachable. The cached taxonomy MUST be the most recently synchronised version and MUST carry a staleness indicator that triggers an operational degradation mode if the cache is more than 72 hours out of synchronisation.
4.9.3 For Research / Discovery agents operating under experimental or exploratory task regimes, the deploying organisation MUST ensure that the creative latitude granted by research workflows does not extend to the suspension or bypass of refusal and escalation obligations. Research task framing MUST NOT constitute a recognised exemption from instruction evaluation.
4.9.4 For Customer-Facing agents, refusal responses delivered to end users MUST be calibrated to be clear, non-stigmatising, and actionable — providing the user with information about how to seek assistance through alternative means — without disclosing the internal taxonomy parameters or escalation routing details of the governance configuration.
The central structural vulnerability that this dimension addresses is the conflation of authorisation with legitimacy. Conventional access control architectures are designed to answer one question: does this principal have the permission to perform this class of action? That question is necessary but not sufficient for safe AI agent governance. An agent that executes any instruction for which a valid credential exists has no effective safety layer — it has a perimeter without an interior. The practical consequence is that any principal who can obtain, steal, coerce, or inherit a sufficiently privileged credential can direct the agent to perform any action within that credential's scope, regardless of whether that action is safe, ethical, or policy-compliant.
This is not a theoretical concern. The examples in Section 3 demonstrate three distinct failure modes — insider fraud, safety management failure, and procedural rights violation — all of which involved valid credentials and none of which were detectable by an access-control-only architecture. The refusal governance framework in this dimension adds a semantic evaluation layer that operates downstream of authorisation: it asks not merely whether the principal can issue this instruction, but whether the agent should comply with it.
Structural controls — access control matrices, permission hierarchies, API token scopes — are deterministic and auditable but blind to semantic content. They enforce the rules that were anticipated when the control was designed. Behavioural controls — refusal logic, escalation gates, coercion resistance — operate at the semantic level and can respond to unanticipated instruction patterns, novel attack vectors, and the kinds of legitimate-but-dangerous instructions that no access control designer thought to prohibit by permission scope.
The combination is necessary because neither class of control is sufficient alone. A purely structural control architecture is vulnerable to insider threat, credential compromise, and legitimate-authority abuse. A purely behavioural control architecture without structural underpinning is subject to manipulation through authoritative assertion — a sophisticated attacker who can convince an agent that their instruction is legitimate can bypass semantic evaluation entirely. This dimension's requirements are designed to operate as the behavioural layer of a defence-in-depth architecture in which structural controls (governed by AG-004, AG-055) provide the authorisation floor and behavioural controls provide the semantic ceiling.
A specific failure mode that warrants dedicated reasoning is the exploitation of authority gradients in multi-agent systems and orchestration pipelines. When an instruction passes through multiple system components before reaching the executing agent, each handoff is a potential manipulation point. An adversary who can intercept or inject at any handoff can present the executing agent with an instruction that carries the apparent authority of the originating principal even though the instruction's content has been modified. This dimension's requirement that instruction evaluation be performed at each step boundary (4.2.4), that peer-agent authority assertions be verified against registered governance profiles (4.5.3), and that pre-authorised instruction sources not be exempt from evaluation (4.5.2) are all specifically designed to close the authority gradient exploitation surface.
Inconsistent refusal behaviour — an agent that refuses a given instruction type in some contexts but not others — is itself a governance failure. Inconsistency creates an attack surface (adversaries can probe for contexts in which refusal is suppressed) and undermines organisational trust in the governance framework. This dimension's requirements for semantic equivalence detection (4.8.3), coercion-invariant refusal posture (4.8.1), and cross-profile taxonomy application (4.9.1–4.9.3) are all aimed at producing a governance property of refusal consistency: the agent's decision to refuse or escalate a given semantic instruction should be the same regardless of how the instruction is framed, who issues it, through which channel it arrives, or how much pressure is applied.
Declarative Taxonomy Configuration. Implement the unsafe work taxonomy as a declarative, machine-readable policy document (e.g., a structured JSON or YAML schema) that is loaded into the agent's governance runtime at initialisation and cryptographically signed by the responsible governance authority. This approach enables version-controlled review, diff-based audit of changes, and automated validation against the organisation's master policy registry. Avoid embedding taxonomy logic directly in agent prompts or model weights where it is opaque to audit and difficult to update.
Pre-Execution Evaluation Pipeline. Implement instruction evaluation as a discrete, synchronous pipeline stage that runs before any action execution context is established. The pipeline should: (a) parse and classify the instruction against the taxonomy; (b) resolve contextual risk signals; (c) apply jurisdiction resolution where applicable; and (d) return a disposition code (proceed, escalate, refuse) before any action module is invoked. This architectural separation ensures that evaluation cannot be bypassed by action modules that initiate execution speculatively.
Dual-Write Logging Architecture. Write refusal and escalation logs to two independent storage targets simultaneously — a local append-only log and a remote governance log store. This ensures log availability if either store is compromised, and provides the tamper-evidence required by section 4.6.2 without relying on a single point of trust.
Human Escalation Channel Separation. Implement escalation routing through a channel that is architecturally separate from the agent's primary instruction interface. If the agent receives instructions via an API, escalation notifications should route via a different protocol (e.g., email gateway, ticketing system, pager integration) to ensure that an attacker who has compromised the instruction channel cannot also intercept or suppress escalation notifications.
Staleness-Aware Offline Mode. For Embodied / Edge / Robotic agents, implement a tiered offline mode: within the 72-hour synchronisation window, the agent operates normally with the cached taxonomy; beyond 72 hours, the agent enters a restricted operation mode that blocks any instruction category that is not in the lowest-risk tier of the taxonomy, pending resynchronisation. This pattern satisfies section 4.9.2 without requiring the agent to refuse all operation when connectivity is lost.
Refusal Response Templates. Maintain a library of refusal response templates, one per taxonomy category, that are crafted to be informative without disclosing implementation detail. Templates for customer-facing agents (section 4.9.4) should be written in plain language, include a reference number for the refusal event, and provide a concrete alternative action path (e.g., a phone number, a human agent escalation option, or a formal request process).
Coercion Pattern Detection. Implement a stateful coercion detector that tracks the instruction history within a session and across sessions for a given principal. Define a coercion threshold (e.g., three or more semantically equivalent refused instructions within a 30-minute window from the same source) that triggers the increased-scrutiny alert required by section 4.8.2. The detector should be calibrated per profile — customer-facing agents may see higher rates of legitimate rephrasing from frustrated users and require a higher threshold than enterprise workflow agents.
Do not use the model's own output as the refusal gate. Relying on the underlying language model to refuse a request through its own trained behaviour is insufficient for governance purposes. Model-native refusals are inconsistent, subject to jailbreak and prompt injection, and not auditable. The refusal governance layer must be an architectural control that operates independently of the model's generation behaviour.
Do not implement taxonomy whitelisting by principal identity. A common shortcut is to designate certain principals (e.g., the system administrator, the highest-tier operator) as exempt from evaluation. This creates an exploitable blind spot: if that principal's account is compromised, their instructions bypass all governance controls. Principals may have elevated authorisation scope, but no principal is exempt from evaluation.
Do not allow refusal decisions to be cached or reused across instruction instances. A refusal decision is specific to an instruction instance, including its context. An agent that caches a "safe" classification for an instruction type and skips evaluation for subsequent instances of the same type is vulnerable to context-switching attacks in which the same surface instruction carries different semantic payload depending on accumulated session context.
Do not suppress escalation notifications to avoid false positives. Organisations frequently disable escalation routing for high-volume task categories because the escalation notification volume is operationally burdensome. The correct response to high escalation volume is to refine the taxonomy classification thresholds, not to suppress notifications. Notification suppression eliminates human oversight for exactly the task categories that are generating safety signals.
Do not use the refusal layer as a content moderation system. Conflating task refusal governance with content moderation (e.g., filtering offensive language) introduces scope confusion and creates pressure to relax governance thresholds on the grounds that refusals are "over-triggering" on content issues. The unsafe work taxonomy should govern actions and their consequences, not the surface form of instructions.
Do not allow research or experimental mode to serve as a governance bypass. Research / Discovery agents and sandbox environments frequently operate under relaxed governance configurations. If those configurations include suspension of refusal evaluation, the agent becomes ungoverned precisely in the environments most likely to encounter novel and unanticipated instruction types. Research mode MUST NOT include refusal bypass as a feature.
| Maturity Level | Characteristics |
|---|---|
| Level 1 — Initial | No formal unsafe work taxonomy. Refusal behaviour is solely model-native. No structured logging of refusal events. Escalation, if it occurs, is ad hoc and unrecorded. |
| Level 2 — Developing | A documented unsafe work taxonomy exists but is not machine-readable or version-controlled. Refusal logic is partially implemented as a pre-execution check for a subset of instruction categories. Escalation routing is defined but not consistently enforced. Logging is present but not tamper-evident. |
| Level 3 — Defined | Full declarative taxonomy implemented as a versioned, signed policy document. Pre-execution evaluation pipeline covers all instruction categories. Escalation routing is automated, channel-separated, and enforced with timeout handling. Logging is dual-write and tamper-evident. Coercion detection is in place. Taxonomy review cycle is operative. |
| Level 4 — Managed | Taxonomy is dynamically risk-stratified based on real-time contextual signals (e.g., transaction velocity, jurisdictional flags, baseline deviation). Override events trigger automated post-event reviews. Refusal consistency is quantitatively measured and reported. Cross-agent taxonomy synchronisation is in place for multi-agent deployments. |
| Level 5 — Optimising | Refusal and escalation data feeds a continuous improvement loop for taxonomy calibration. Machine-assisted taxonomy gap analysis runs against refusal logs quarterly. Cross-organisation benchmarking of refusal rate distributions informs threshold tuning. Formal verification of refusal logic against safety requirements is in place for Safety-Critical and Embodied profiles. |
7.1.1 Unsafe Work Taxonomy Document. A version-controlled, machine-readable document defining all prohibited, restricted, and escalation-required instruction categories applicable to the agent's deployment context. Must include: version number, effective date, responsible governance authority signature or cryptographic sign-off, and a changelog from the preceding version. Retention: lifetime of deployment plus five years.
7.1.2 Taxonomy Review Records. Dated records of each quarterly taxonomy review, including attendees, items reviewed, decisions made, and any resulting taxonomy changes. For Safety-Critical and Embodied profiles, records must include the safety officer sign-off required by section 4.1.4. Retention: five years from review date.
7.1.3 Refusal and Escalation Log Store. The full append-only, tamper-evident log store described in section 4.6, including all structured records for every refusal and escalation event. Retention: as specified in section 4.6.3, with a minimum of five years for regulated profiles.
7.1.4 Escalation Configuration Documentation. Documentation of the escalation routing configuration, including: the categories that trigger escalation, the escalation targets and their governance roles, the communication channels used, and the maximum escalation window for each category. Must be version-controlled. Retention: lifetime of deployment plus two years.
7.1.5 Override Records. A complete record of all operator override events, including the structured log record required by section 4.7.2 and the post-event review record required by section 4.7.3. Retention: five years from override event date.
7.1.6 Coercion Detection Alerts and Dispositions. A log of all coercion pattern alerts generated under section 4.8.2, including the alert timestamp, the session and principal involved, the pattern that triggered the alert, and the governance reviewer's disposition of the alert. Retention: three years from alert date.
7.1.7 Pre-Deployment Evaluation Test Results. Results of all pre-deployment test runs conducted against the test specification in Section 8, including the test environment description, test case inputs and outputs, and conformance scores. Retention: lifetime of deployment.
7.1.8 Governance Review Reports. Monthly governance review reports produced under section 4.6.4, summarising refusal and escalation activity, override trends, coercion alert rates, and any taxonomy change recommendations arising from the review. Retention: three years from report date.
Objective: Verify that the agent unconditionally refuses instructions falling within the absolute prohibition category and does not execute any component of the instruction.
Method: Present the agent with a set of 20 test instructions, 10 of which are constructed to fall within the absolute prohibition category of the deployment's taxonomy (as confirmed by the governance authority prior to testing) and 10 of which are clean control instructions. Issue all 20 instructions through the agent's primary instruction interface using a credential at the highest available operator permission tier. Record whether the agent executes, partially executes, or refuses each instruction, and whether a structured refusal log record is generated for each refused instruction.
Pass Criteria:
Conformance Score:
| Regulation | Provision | Relationship Type |
|---|---|---|
| EU AI Act | Article 9 (Risk Management System) | Direct requirement |
| EU AI Act | Article 15 (Accuracy, Robustness and Cybersecurity) | Direct requirement |
| NIST AI RMF | GOVERN 1.1, MAP 3.2, MANAGE 2.2 | Supports compliance |
| ISO 42001 | Clause 6.1 (Actions to Address Risks), Clause 8.2 (AI Risk Assessment) | Supports compliance |
Article 9 requires providers of high-risk AI systems to establish and maintain a risk management system that identifies, analyses, estimates, and evaluates risks. Agent Task Refusal and Unsafe Work Rejection Governance implements a specific risk mitigation measure within this framework. The regulation requires that risks be mitigated "as far as technically feasible" using appropriate risk management measures. For deployments classified as high-risk under Annex III, compliance with AG-728 supports the Article 9 obligation by providing structural governance controls rather than relying solely on the agent's own reasoning or behavioural compliance.
Article 15 requires high-risk AI systems to achieve appropriate levels of accuracy, robustness, and cybersecurity. Agent Task Refusal and Unsafe Work Rejection Governance directly supports the robustness and cybersecurity requirements by implementing structural controls that resist adversarial manipulation and ensure system integrity under attack conditions.
GOVERN 1.1 addresses legal and regulatory requirements; MAP 3.2 addresses risk context mapping; MANAGE 2.2 addresses risk mitigation through enforceable controls. AG-728 supports compliance by establishing structural governance boundaries that implement the framework's approach to AI risk management.
Clause 6.1 requires organisations to determine actions to address risks and opportunities within the AI management system. Clause 8.2 requires AI risk assessment. Agent Task Refusal and Unsafe Work Rejection Governance implements a risk treatment control within the AI management system, directly satisfying the requirement for structured risk mitigation.
| Field | Value |
|---|---|
| Severity Rating | Critical |
| Blast Radius | Organisation-wide — potentially cross-organisation where agents interact with external counterparties or shared infrastructure |
| Escalation Path | Immediate executive notification and regulatory disclosure assessment |
Consequence chain: Without agent task refusal and unsafe work rejection governance, the governance framework has a structural gap that can be exploited at machine speed. The failure mode is not gradual degradation — it is a binary absence of control that permits unbounded agent behaviour in the dimension this protocol governs. The immediate consequence is uncontrolled agent action within the scope of AG-728, potentially cascading to dependent dimensions and downstream systems. The operational impact includes regulatory enforcement action, material financial or operational loss, reputational damage, and potential personal liability for senior managers under applicable accountability regimes. Recovery requires both technical remediation and regulatory engagement, with timelines measured in weeks to months.