Skip to content

Rate limit

techrevati.runtime.rate_limit

Rate limiting — Token-bucket primitives for sync and async call paths.

Token-aware throttling is the modern shape for LLM-provider rate limits because providers themselves are token-based (TPM, RPM, daily caps). A single TokenBucket admits or delays one resource (e.g. input tokens-per-minute); a RateLimiter composes three named buckets so typical provider limits (input TPM, output TPM, request RPM) can be expressed as one object.

Both TokenBucket (sync, threading.Lock) and AsyncTokenBucket (asyncio.Lock + asyncio.sleep) implement the same conceptual algorithm; async wins by yielding the event loop while waiting for refill instead of blocking it.

Clock is injectable on both variants (Callable[[], float] returning monotonic seconds). Tests pass a ManualClock to make timing-dependent behavior deterministic; production code uses time.monotonic by default.

Zero new runtime dependencies — stdlib only.

TokenBucket dataclass

TokenBucket(name, capacity, refill_per_second, clock=time.monotonic)

Classic token-bucket limiter — sync variant.

try_acquire is non-blocking and returns True only when the bucket has enough tokens. acquire sleeps until the bucket refills, capped by an optional wait timeout; on timeout it raises RateLimitExceededError rather than silently exceeding the bound.

Parameters

name: Human-readable identifier used in error messages. capacity: Maximum tokens the bucket holds. Bursts up to this many requests can pass immediately. refill_per_second: Steady-state admission rate. clock: Monotonic time source. Defaults to time.monotonic.

available property

available

Current token balance (mostly useful for diagnostics + tests).

try_acquire

try_acquire(tokens=1.0)

Spend tokens if available; return whether the spend succeeded.

acquire

acquire(tokens=1.0, *, timeout=None)

Block until tokens are available, or raise on timeout.

timeout is the maximum wall-clock time we will sleep waiting for refill; None waits indefinitely.

AsyncTokenBucket dataclass

AsyncTokenBucket(name, capacity, refill_per_second, clock=time.monotonic)

Async sibling of TokenBucket.

Uses asyncio.Lock so refill bookkeeping is coroutine-safe, and asyncio.sleep so waiting yields control to the event loop instead of pinning the thread. State is independent from the sync variant — choose one per downstream.

RateLimiter dataclass

RateLimiter(buckets=dict())

Composite of named token buckets, one per dimension.

Typical LLM-provider shape: rpm for requests-per-minute, input_tpm for input tokens-per-minute, output_tpm for output tokens-per-minute. Each bucket is independent; an empty buckets mapping is a valid no-op limiter.

acquire_pre_call spends RPM up front. After the call returns and UsageSnapshot is known, acquire_usage spends the input + output token budgets. This split mirrors how providers actually enforce limits.

acquire_pre_call

acquire_pre_call(*, request_cost=1.0, timeout=None)

Block on RPM bucket (buckets["rpm"]) if configured.

acquire_usage

acquire_usage(*, input_tokens, output_tokens, timeout=None)

Block on input/output TPM buckets after a turn completes.

AsyncRateLimiter dataclass

AsyncRateLimiter(buckets=dict())

Async sibling of RateLimiter — same semantics, async buckets.

RateLimitExceededError

RateLimitExceededError(bucket_name, tokens)

Bases: Exception

Raised when an acquire call exceeds the bucket's wait budget.

Carries the bucket name and the cost the caller tried to spend so the error message tells the caller which dimension blocked (input TPM vs RPM) and how big the request was. classify_exception maps this onto FailureScenario.LLM_ERROR (the rate-limit bucket) so existing recovery recipes pick it up unchanged.