Skip to content

Governance Plane

GovernancePlane is the runtime's last line of defense: hard-stop limits enforced outside agent code so the agent cannot bypass them via recovery. When a limit configured with on_breach="terminate" is exceeded, the orchestrator raises GovernanceBreachError, which the session marks as FAILED and re-raises without going through the failure classifier or the recovery loop.

This is the technical primitive auditors expect for EU AI Act deployments — Article 14 (human oversight via stopping conditions), Article 15 (robustness / fail-safes), and Article 26 (deployer monitoring + reporting). The full article-by-article compliance mapping ships in 0.3.0 Sprint 6 (docs/compliance/).

When to use this

  • You ship to EU customers and any user-deployed system would meet the Annex III "high-risk" definition. Article 9 risk management, Article 12 record-keeping, and Article 26 deployer-side monitoring all want a runtime kill-switch.
  • Cost is dollars per token, not milliseconds per request. A budget cap that the agent could in principle catch + recover from is not a hard cap. MaxBudgetLimit(on_breach="terminate") is.
  • The agent runs unattended and must stop on its own. Production multi-step agent loops can drift; a 25-turn iteration cap + a consecutive-failure cap rules out the canonical runaway-loop failure mode.
  • You need a rollout signal before flipping a knob to hard-stop. Use on_breach="alert" first; measure breach rates from the governance.alert events; flip to "terminate" when you trust the threshold.

When NOT to use this

  • For recoverable token / cost ceilings inside one session that the agent is allowed to react to. Use UsageLimits instead — its UsageLimitExceededError is catchable and recovery flows can respond.
  • For per-tool authorization. Use PermissionEnforcer.
  • For pattern blocking on tool inputs or outputs. Use the built-in PatternGuardrail / PromptInjectionGuardrail (or a custom Guardrail) — see Permissions and API: Guardrails.

Quickstart

from techrevati.runtime import (
    AgentSession,
    GovernancePlane,
    MaxBudgetLimit,
    MaxConsecutiveFailuresLimit,
    MaxIterationsLimit,
    MaxToolCallsLimit,
)

plane = GovernancePlane(
    limits=(
        MaxIterationsLimit(value=25, on_breach="terminate"),
        MaxBudgetLimit(value=5.00, on_breach="terminate"),
        MaxConsecutiveFailuresLimit(value=3, on_breach="terminate"),
        MaxToolCallsLimit(value=100, on_breach="alert"),
    ),
)

session = AgentSession(role="writer", phase="draft", governance=plane)

The orchestrator ticks the plane's counters at three points:

  • Pre-turnrecord_turn_start() and enforce(). The iteration cap fires here.
  • Pre-toolrecord_tool_call() and enforce(). The tool-call cap fires here.
  • Post-turnrecord_success() or record_failure(), record_cost(cost_delta), and enforce(). The budget and consecutive-failures caps fire here.

The four built-in limits

MaxIterationsLimit

Caps total turns in the session. Distinct from AgentSession.max_iterations — that one raises a recoverable MaxIterationsExceededError; this one is terminal.

MaxIterationsLimit(value=25, on_breach="terminate")

MaxBudgetLimit

Caps cumulative cost in USD. Distinct from UsageLimits.cost_usd_max — that one is recoverable; this one is terminal.

MaxBudgetLimit(value=5.00, on_breach="terminate")

MaxConsecutiveFailuresLimit

Counts consecutive failures. A single successful turn resets the counter to zero. Catches "the agent retries the same broken thing forever" failure modes that per-step retry budgets alone do not.

MaxConsecutiveFailuresLimit(value=3, on_breach="terminate")

MaxToolCallsLimit

Caps total tool invocations in the session. Distinct from UsageLimits.tool_calls_max only in being terminal.

MaxToolCallsLimit(value=100, on_breach="alert")

on_breach modes

Mode Behavior
"terminate" (default) Raises GovernanceBreachError. Worker → FAILED. Recovery loop is NOT invoked.
"alert" Emits a governance.alert event on every breached evaluation. Session continues.

Rolling out a new limit safely: deploy with "alert" for 1–2 weeks, observe the governance.alert event rate, then flip to "terminate".

Event surface

Two new AgentEventName values surface in 0.3.0:

  • governance.breach — emitted before GovernanceBreachError raises so downstream sinks see the breach in the audit log even when the exception propagates past them.
  • governance.alert — emitted once per evaluation per breached alert-mode limit.

Both carry data = {limit_name, observed, ceiling, scope} for sink serialization.

Composing with UsageLimits

These two primitives are not redundant — they sit at different layers.

sess = AgentSession(
    role="writer",
    phase="draft",
    # Soft cap: agent code can catch UsageLimitExceededError and react.
    usage_limits=UsageLimits(total_tokens_max=200_000),
    # Hard cap: governance breach terminates the session regardless.
    governance=GovernancePlane(
        limits=(MaxBudgetLimit(value=10.00, on_breach="terminate"),),
    ),
)

A common pattern is: usage_limits cap at 80% of the budget, governance hard-stop at 100%. The agent gets a recoverable warning before the session dies.

Tuning the knobs

Knob Reasonable range Notes
MaxIterationsLimit.value 10–50 for production loops Same default as OpenAI Agents SDK.
MaxBudgetLimit.value per-customer / per-session limit Pair with UsageLimits.cost_usd_max at 80%.
MaxConsecutiveFailuresLimit.value 2–5 Below 2 is twitchy; above 5 hides real reliability bugs.
MaxToolCallsLimit.value 5×–10× expected Useful as an alert before flipping to terminate.
on_breach="alert" Always start here for new limits Measure first, terminate second.

Anti-patterns

  • Catching GovernanceBreachError and retrying. Don't. The point of the plane is that the agent cannot bypass it. If you want recoverable behavior, use UsageLimits instead.
  • Putting business logic in GovernanceState.record_*. The state object is a counter; do not subclass it to fire side effects on every tick. Add a custom EventSink for that.
  • One plane per turn. Construct the plane once and pass it to AgentSession; do not create a new plane on each turn — counters reset.
  • Mixing "alert" and "terminate" randomly across limits. Pick a rollout phase per limit, document it, and don't half-migrate.

Sources