Skip to content

Retry Policy

techrevati.runtime.retry_policy

Retry Policy — Failure classification and recipe lookup.

Maps failure scenarios to structured recovery steps with bounded attempts and an escalation policy. The caller decides whether and how to retry; this module provides the recipe + bookkeeping.

classify_exception() bridges Python exceptions to failure scenarios.

FailureScenario

Bases: str, Enum

Failure types that can be automatically recovered.

RecoveryStep

Bases: str, Enum

Actions that can be taken to recover from failure.

EscalationPolicy

Bases: str, Enum

What to do when max recovery attempts are exhausted.

RecoveryRecipe dataclass

RecoveryRecipe(scenario, steps, max_attempts, escalation_policy, step_retries=dict())

Recovery plan for a failure scenario.

step_retries is an optional per-step retry budget. When a step fails (the recovery context's _fail_at_attempt hook returns True), the executor retries that same step up to step_retries[step] times before declaring it a failure and moving on to the next step (which becomes remaining_steps in partial recovery). Missing keys default to a budget of 1 (single attempt) — preserving 0.1.0 / 0.2.0 semantics.

Example::

RecoveryRecipe(
    scenario=FailureScenario.LLM_ERROR,
    steps=(RecoveryStep.RETRY_WITH_BACKOFF, RecoveryStep.SWITCH_PROVIDER),
    max_attempts=2,
    escalation_policy=EscalationPolicy.ALERT_HUMAN,
    step_retries={RecoveryStep.RETRY_WITH_BACKOFF: 3},
)

fires the backoff step up to three times before failing over to the provider switch.

RecoveryResult dataclass

RecoveryResult(outcome, steps_taken=0, recovered_steps=list(), remaining_steps=list(), reason='')

Outcome of a recovery attempt.

RecoveryEvent dataclass

RecoveryEvent(event_type, scenario, recipe_steps, result, timestamp)

Structured record of a recovery action.

RecoveryContext

RecoveryContext()

Tracks recovery attempts per scenario within a session.

recipe_for

recipe_for(scenario)

Look up the recovery recipe for a failure scenario.

attempt_recovery

attempt_recovery(scenario, ctx)

Attempt recovery for a failure scenario.

Returns RecoveryResult with outcome: recovered, partial_recovery, or escalation_required.

aattempt_recovery async

aattempt_recovery(scenario, ctx, *, sleeper=None)

Async variant of attempt_recovery.

Behavior matches the sync version step-for-step. The sleeper parameter is reserved for future steps that need to await a delay (e.g. backoff). Pass asyncio.sleep in production code; pass a no-op or a fake in tests for determinism. Today no step in RecoveryRecipe actually sleeps, so sleeper is unused in practice — but the contract is established now so 0.1.0 callers can rely on it.

backoff_delay

backoff_delay(attempt, base=2.0, jitter='decorrelated', cap=60.0, prev_delay=0.0)

Calculate backoff delay in seconds with selectable jitter algorithm.

Algorithms follow Marc Brooker / AWS Architecture Blog (https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/):

  • "none" — pure exponential base ** attempt (capped).
  • "full"uniform(0, cap_exp): maximum spread, lowest contention.
  • "equal"cap_exp/2 + uniform(0, cap_exp/2): half deterministic.
  • "decorrelated" (default) — uniform(base, prev_delay * 3): AWS's fastest algorithm. Callers passing 0 for prev_delay get base.

Backwards compatibility: jitter=True (bool) maps to "full" and jitter=False maps to "none". The base ** attempt + 25% noise formula from 0.0.0 is gone — use "equal" for similar behavior.

next_provider

next_provider(available_providers, current_provider)

Select the next fallback provider, skipping the current one.

smaller_context_budget

smaller_context_budget(current_chars, reduction=0.75)

Calculate a reduced context budget (75% of current by default).

classify_exception

classify_exception(error)

Map a Python exception to a FailureScenario for recovery.

Two-pass dispatch:

  1. Type-basedisinstance against well-known stdlib classes (TimeoutError, ConnectionError family, JSONDecodeError). Walks the exception chain via __cause__ / __context__ so a RuntimeError wrapping a ConnectionError is still classified as PROVIDER_FAILURE.
  2. String match — provider SDKs that don't expose stdlib types fall through to substring matching on the rendered message.