Article · monitoring

Monitoring Observability SOP: Which Panel to Open First

Building the dashboards was the easy part. Knowing when to look and what to look at is harder. This post lays out a 5-layer metric hierarchy and 4 incident scenarios, each with the exact PromQL and panel to open first.

2026-04-1718 min readmonitoringsopoperations

The previous post covered how Prometheus + Grafana got wired into Deeppin. This one is about a more practical problem: when things actually go wrong, don't stare at the dashboard scanning panel-by-panel. Have an order, a hierarchy, an SOP. Otherwise you burn attention and still miss the root cause.

§ 011. Metric hierarchy: order by what matters when things break

Metrics should be organized by "what do I care about first during an incident", not by module. Five layers:

L0  Availability    Is backend alive?            → up / 5xx rate
L1  User experience How long are users waiting?  → p95 / p99 latency
L2  Dependency     LLM / DB / search OK?        → component error rate + latency
L3  Capacity       How much runway is left?     → LLM slot usage / limit
L4  Cost trend     Is token burn reasonable?    → tokens_total long-term

Principle: if L0 is down, don't bother with L1-L4. Only go deeper once the layer above is green. The top three are real-time; L4 is weekly/monthly.

§ 022. Key metrics at each layer

L0 — Availability

up{job="deeppin-backend"} — 0 means the process is dead. Alert: == 0 for 2m
5xx ratio: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])); sustained > 1% is a warning
4xx surge: usually a client/frontend bug, not a system issue — but a sudden spike deserves a look

Dashboard location: Overview row → Error Rate (5xx) panel.

L1 — User experience

Global p95: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
Per-handler p95: add the handler label to locate the specific endpoint
In-flight requests: http_requests_in_progress — persistent climb means we can't drain as fast as we ingest

One trap: SSE endpoints (/api/threads/:id/chat, /api/search) are long-lived streams. Their p95 is naturally high — that's the wall-clock time for the whole stream, not "how long before the user saw the first token". For SSE, time-to-first-byte (TTFB) is what you actually want. We haven't instrumented it yet — it's on the backlog.

L2 — Dependency health

Three external/local dependencies, check both error rate and p95 latency:

Supabase:  rate(deeppin_supabase_calls_total{result="error"}[5m])
           < 0.01/s normal, > 0.05/s Supabase side likely in trouble
SearXNG:   rate(deeppin_searxng_calls_total{result=~"timeout|error"}[5m])
           occasional timeouts are daily life for meta-search; persistent high = check container logs
Embedding: rate(deeppin_embedding_calls_total{result="error"}[5m])
           bge-m3 is local inference — if it errors, usually container OOM

L3 — LLM capacity

This is Deeppin's unusual bit: we stack free tiers across multiple providers, so slot-level watermarks actually matter.

deeppin_llm_slot_score: 0 = exhausted (rate limited), 1 = fresh. Any slot stuck at 0 is worth a look
rpd_used / rpd_limit: today's request usage ratio — watch when > 80%
tpd_used / tpd_limit: today's token usage ratio
rate(deeppin_llm_failures_total[5m]): group by provider — a wide spike means upstream is having a bad day

Common PromQL:

# Remaining capacity per group (> 0.3 = still has room)
avg by (group) (deeppin_llm_slot_score)

# Groups where every slot is exhausted (urgent)
min by (group) (deeppin_llm_slot_score) == 0

# Top 10 slots by daily RPD usage
topk(10, deeppin_llm_rpd_used / deeppin_llm_rpd_limit)

L4 — Cost trend

No realtime alerts; check weekly or monthly:

# Tokens burned per provider per day
sum by (provider) (increase(deeppin_llm_tokens_total[24h]))

# Calls per provider per day
sum by (provider) (increase(deeppin_llm_calls_total[24h]))

§ 033. SOPs for four real scenarios

Scenario A — User reports "the site is down"

1. Hit the /health endpoint, check the ok state
   → 502 / timeout = frontend-to-backend link is dead
   → 200 but components show red = backend alive, a dependency is down

2. SSH into the production host, inspect container state (docker ps)
3. Tail the backend container logs (docker logs --tail 200)
4. If the process died: docker compose up -d backend
5. Go back to Grafana Overview, confirm Error Rate drops

Budget: recover or escalate within 5 minutes.

Scenario B — User reports "chat is hanging forever"

1. Grafana → HTTP row → P95 Latency by Handler
   - /api/threads/*/chat slow: not surprising (SSE long-lived); look at LLM
   - Non-SSE endpoint slow: check Supabase p95

2. LLM Slots row → slot_score heatmap
   - Wide zeros? The group is exhausted, waiting on backoff
   - Individual zeros? SmartRouter should auto-shift slots; users shouldn't notice

3. Check LLM Failures/s by Reason — the spiking provider is the culprit

4. Still lost → hit /health/providers/keys for a zero-quota validation

Scenario C — Daily walkthrough (5 minutes in the morning)

Yesterday's daily provider-check workflow result (GitHub Actions)
Grafana Overview: any error spikes in the last 24h?
LLM Slots: rpd_used/rpd_limit heatmap — any slot maxed out several days running? Add a key or rebalance groups
Supabase Calls/s by Table: any table suddenly seeing 10x the usual traffic?

Scenario D — Quota emergency (every chat slot exhausted)

1. Check deeppin_llm_slot_recovery_seconds → when's the earliest slot coming back
2. Look at the fallback chain: chat falls back to summarizer; check its remaining capacity
3. Short-term fix: add a key for the affected provider
   - Edit xxx_API_KEYS in compose.env
   - docker compose up -d --force-recreate backend
     (restart does NOT reload env — you MUST --force-recreate)
4. Long-term fix: add a new provider to ModelSpec

One gotcha: Python reads env into memory once at process start and never re-reads. So adding a key requires container recreation, not just restart. This is documented in CLAUDE.md specifically because we burned ourselves on it.

§ 044. Dashboard panel cheatsheet

Incident type            → which row to open first
────────────────────────────────────────────────
Site-wide 5xx            → Overview
Specific endpoint slow   → HTTP → P95 by Handler
AI replies wrong/failing → LLM Slots → Failures/s + slot_score
Search broken            → Components → SearXNG Calls/s by Status
Login / history failing  → Components → Supabase
Attachment upload broken → Components → Embedding

§ 055. Current blind spots

Re-ranked by operational priority (the previous post covered them; this is the priority order):

🔴 No Alertmanager: incidents are discovered either by users or by manual polling. The bare minimum — up==0 for 2m → Telegram bot — should land first
🟡 SSE TTFB not instrumented: the metric users feel most ("AI is slow") is the one we can't see
🟡 No tracing: for a slow request, we can't tell whether context build, LLM call, or Supabase write is to blame
🟢 Logs not in Loki: tail -f app.log is fine at current volume; revisit when traffic grows

§ 066. Closing

Monitoring has a characteristic ROI curve: all investment upfront (instrumentation, stack setup, dashboard tuning) with no direct value for a long time. Then one day there's an incident and it saves you. After that, every subsequent incident compounds the value.

That's what SOPs are for: they turn "monitoring can save you" from "if you happen to remember where to look that day" into "follow the procedure, locate the issue in three minutes". Dashboards are the tool; the SOP is the muscle memory.