Article · monitoring

Classifying LLM Failures Into 6 Reasons: Actionable Alerts, Observable Quotas, Trusted Zero-Quota Checks

A single errors counter tells you "something's wrong" but never whether your key broke, the provider is down, or it's just rate limiting. Deeppin labels LLM failures with one of six reasons, pairs them with a zero-quota key/catalog endpoint and timezone-aligned RPD resets, and turns "is this 429 noise or signal?" into a well-defined question.

2026-04-1820 min readmonitoringprometheussmartrouterobservability

SmartRouter runs across 15 model slots spanning 5 free-tier providers, handling thousands of LLM calls a day. Some fraction always fails. The question is — when the failure rate climbs from 0.5% to 3%, is it worth waking up for? It depends on what's failing. Deeppin tags each LLM failure with a reason label in Prometheus, making "is this 429 noise or signal?" a query you can actually answer.

§ 011. What a bare counter can't tell you

Early on, LLM_FAILURES had three labels: provider, model, key_prefix. Each failure incremented the counter. The Grafana panel showed a single time series — "total failures in the last hour." The problem:

A rising number doesn't tell you whether to care — someone might have just saturated one key's RPD, and that's fine
Or maybe a key got revoked and every request is returning 401 — urgent, needs a new key
Or the provider is having a 5xx outage, in which case retrying is pointless
Or it's a transient network blip and retries work fine

Four scenarios, four completely different responses, all mashed into one counter. With a reason label, a "by reason" stacked bar chart immediately shows what kind of failure today's failures actually are.

§ 022. The 6 reasons

classify_llm_failure is a short function that maps any exception to one of six low-cardinality labels:

def classify_llm_failure(exc: BaseException) -> str:
    status = getattr(exc, "status_code", None)
    if status == 429:         return "rate_limit"
    if status in (401, 403):  return "auth"
    if isinstance(status, int) and 500 <= status < 600:
        return "server_error"

    if isinstance(exc, (TimeoutError, asyncio.TimeoutError)):
        return "timeout"
    if isinstance(exc, ConnectionError):
        return "network"

    type_name = type(exc).__name__.lower()
    if "timeout" in type_name:  return "timeout"
    if "connect" in type_name:  return "network"

    return "other"

rate_limit (429) — quota exhausted, expected, recovers automatically after seconds_until_recovery
auth (401/403) — key rejected, won't self-heal, needs human intervention
server_error (5xx) — provider-side outage, just wait
timeout — network-layer failure or model genuinely slow, try the next slot
network — DNS/TCP errors, local network issue
other — catch-all for anything unmatched

Classification priority: HTTP status code first (most trustworthy), then built-in exception types (TimeoutError, etc.), then case-insensitive substring matching on the type name — the last layer catches things like httpx.ConnectError and asyncio.CancelledError without needing to import every library's exception hierarchy.

§ 033. Bounded cardinality

Prometheus cardinality explosions are a classic foot-gun: if the reason field could be any string (say, the exception message), each unique error spawns a new time series and memory grows linearly with variety.

_VALID_REASONS = frozenset({
    "rate_limit", "server_error", "auth",
    "timeout", "network", "other",
})

def record_llm_failure(*, provider, model, key_prefix, reason) -> None:
    if reason not in _VALID_REASONS:
        reason = "other"
    LLM_FAILURES.labels(provider, model, key_prefix, reason).inc()

record_llm_failure enforces the allowlist at the entry point — anything outside the fixed set collapses to other. Total cardinality = providers × models × keys × 6 = low hundreds. Bounded.

§ 044. Companion: zero-quota key validation

reason=auth means a key is bad — but waiting for production traffic to hit a 401 is too late. CI wants to catch revoked keys before deploy. The intuitive approach is to send each key a minimal LLM request for liveness; problem is, that eats quota. CI running hourly chews through hundreds of RPD per day.

GET /health/providers/keys is the zero-quota alternative. Every provider exposes a GET /v1/models endpoint (OpenAI-compatible) or equivalent (Gemini's ?key= query). It returns the available model catalog, consumes no LLM quota, and validates the key. As a bonus, we diff the configured ALL_MODELS against the returned catalog.

_MODELS_ENDPOINTS = {
    "groq":       ("https://api.groq.com/openai/v1/models",           "bearer"),
    "cerebras":   ("https://api.cerebras.ai/v1/models",               "bearer"),
    "sambanova":  ("https://api.sambanova.ai/v1/models",              "bearer"),
    "openrouter": ("https://openrouter.ai/api/v1/models",             "bearer"),
    "gemini":     ("https://generativelanguage.googleapis.com/v1beta/models", "query"),
}

# For each (provider, key) pair, fetch /models once:
#   401/403 → key_valid=False
#   200 → compute configured_model_ids - available as missing_models
# Returns { total, ok, failed, results: [...] }

Gemini isn't OpenAI-compatible: auth goes in ?key= and the schema is {"models": [{"name": "models/..."}]}. _extract_model_ids branches per provider. The missing_models field catches silent upstream catalog drift — over the past few months we've seen SambaNova bulk-retire models and Cerebras yank gpt-oss from the free tier, and this endpoint caught each drift in the daily CI the morning after.

§ 055. Companion: timezone-aligned RPD reset

Part of the noise in reason=rate_limit comes from getting "when does the quota reset" wrong. Gemini resets RPD at Pacific Time 00:00, not UTC. If the local counter rolls over at UTC midnight, you'll see "quota healthy" at UTC 8pm (PT 1pm) but hit the real reset at UTC 8am the following day (PT 1am) — a 16-hour window where slot state can mismatch reality.

@dataclass
class ModelSpec:
    ...
    reset_tz: str = "UTC"     # Groq/Cerebras/SambaNova/OpenRouter
    # Gemini: reset_tz="America/Los_Angeles"

class UsageBucket:
    def __init__(self, spec: ModelSpec):
        self._day_date = datetime.now(ZoneInfo(spec.reset_tz)).date()

    def _maybe_reset(self):
        today = datetime.now(ZoneInfo(self._spec.reset_tz)).date()
        if today != self._day_date:
            self.rpd_used = 0
            self.tpd_used = 0
            self._day_date = today

Key detail: store a date object, not a timestamp. Date comparisons are immune to DST transitions and UTC-offset weirdness — today is today, yesterday is yesterday, regardless of clock jumps. Timestamp comparisons on DST days can be off by an hour, which is how "reset didn't happen when it should have" bugs sneak in.

§ 066. Translating to Grafana

A few PromQL queries for by-reason panels:

# Failure rate per reason, last 5 minutes
sum by (reason) (rate(deeppin_llm_failures_total[5m]))

# Auth failures per key today (catches revoked keys)
sum by (key_prefix) (
  increase(deeppin_llm_failures_total{reason="auth"}[24h])
)

# 429 ratio: rate_limit as fraction of all failures
sum(rate(deeppin_llm_failures_total{reason="rate_limit"}[15m]))
  / sum(rate(deeppin_llm_failures_total[15m]))

Alertmanager isn't wired yet, but the planned rules are: auth failures > 3 in 5 minutes → page (unactionable-by-system error, needs human); rate_limit → never page (expected); server_error sustained > 10% for 15 minutes → page (provider is down).

§ 077. Why this abstraction is worth its weight

SmartRouter is designed so no single slot failure affects availability — one slot exhausted, pick another; one provider down, fall back to the next. The ops-side side effect is that failures get absorbed invisibly inside the router. The reason label pulls those hidden failures back into view. Each one consumes retry budget, and enough of them compound into higher latency, saturated RPM, or worst case every slot exhausted at once (where even the fallback chain can't save you).

The short version: self-healing systems need more monitoring, not less. Otherwise the day they stop self-healing, you have no idea why.