Monitoring Observability SOP: Which Panel to Open First
Building the dashboards was the easy part. Knowing when to look and what to look at is harder. This post lays out a 5-layer metric hierarchy and 4 incident scenarios, each with the exact PromQL and panel to open first.
The previous post covered how Prometheus + Grafana got wired into Deeppin. This one is about a more practical problem: when things actually go wrong, don't stare at the dashboard scanning panel-by-panel. Have an order, a hierarchy, an SOP. Otherwise you burn attention and still miss the root cause.
§ 011. Metric hierarchy: order by what matters when things break
Metrics should be organized by "what do I care about first during an incident", not by module. Five layers:
L0 Availability Is backend alive? → up / 5xx rate L1 User experience How long are users waiting? → p95 / p99 latency L2 Dependency LLM / DB / search OK? → component error rate + latency L3 Capacity How much runway is left? → LLM slot usage / limit L4 Cost trend Is token burn reasonable? → tokens_total long-term
Principle: if L0 is down, don't bother with L1-L4. Only go deeper once the layer above is green. The top three are real-time; L4 is weekly/monthly.
§ 022. Key metrics at each layer
L0 — Availability
- up{job="deeppin-backend"} — 0 means the process is dead. Alert: == 0 for 2m
- 5xx ratio: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])); sustained > 1% is a warning
- 4xx surge: usually a client/frontend bug, not a system issue — but a sudden spike deserves a look
Dashboard location: Overview row → Error Rate (5xx) panel.
L1 — User experience
- Global p95: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
- Per-handler p95: add the handler label to locate the specific endpoint
- In-flight requests: http_requests_in_progress — persistent climb means we can't drain as fast as we ingest
One trap: SSE endpoints (/api/threads/:id/chat, /api/search) are long-lived streams. Their p95 is naturally high — that's the wall-clock time for the whole stream, not "how long before the user saw the first token". For SSE, time-to-first-byte (TTFB) is what you actually want. We haven't instrumented it yet — it's on the backlog.
L2 — Dependency health
Three external/local dependencies, check both error rate and p95 latency:
Supabase: rate(deeppin_supabase_calls_total{result="error"}[5m])
< 0.01/s normal, > 0.05/s Supabase side likely in trouble
SearXNG: rate(deeppin_searxng_calls_total{result=~"timeout|error"}[5m])
occasional timeouts are daily life for meta-search; persistent high = check container logs
Embedding: rate(deeppin_embedding_calls_total{result="error"}[5m])
bge-m3 is local inference — if it errors, usually container OOML3 — LLM capacity
This is Deeppin's unusual bit: we stack free tiers across multiple providers, so slot-level watermarks actually matter.
- deeppin_llm_slot_score: 0 = exhausted (rate limited), 1 = fresh. Any slot stuck at 0 is worth a look
- rpd_used / rpd_limit: today's request usage ratio — watch when > 80%
- tpd_used / tpd_limit: today's token usage ratio
- rate(deeppin_llm_failures_total[5m]): group by provider — a wide spike means upstream is having a bad day
Common PromQL:
# Remaining capacity per group (> 0.3 = still has room) avg by (group) (deeppin_llm_slot_score) # Groups where every slot is exhausted (urgent) min by (group) (deeppin_llm_slot_score) == 0 # Top 10 slots by daily RPD usage topk(10, deeppin_llm_rpd_used / deeppin_llm_rpd_limit)
L4 — Cost trend
No realtime alerts; check weekly or monthly:
# Tokens burned per provider per day sum by (provider) (increase(deeppin_llm_tokens_total[24h])) # Calls per provider per day sum by (provider) (increase(deeppin_llm_calls_total[24h]))
§ 033. SOPs for four real scenarios
Scenario A — User reports "the site is down"
1. Hit the /health endpoint, check the ok state → 502 / timeout = frontend-to-backend link is dead → 200 but components show red = backend alive, a dependency is down 2. SSH into the production host, inspect container state (docker ps) 3. Tail the backend container logs (docker logs --tail 200) 4. If the process died: docker compose up -d backend 5. Go back to Grafana Overview, confirm Error Rate drops
Budget: recover or escalate within 5 minutes.
Scenario B — User reports "chat is hanging forever"
1. Grafana → HTTP row → P95 Latency by Handler - /api/threads/*/chat slow: not surprising (SSE long-lived); look at LLM - Non-SSE endpoint slow: check Supabase p95 2. LLM Slots row → slot_score heatmap - Wide zeros? The group is exhausted, waiting on backoff - Individual zeros? SmartRouter should auto-shift slots; users shouldn't notice 3. Check LLM Failures/s by Reason — the spiking provider is the culprit 4. Still lost → hit /health/providers/keys for a zero-quota validation
Scenario C — Daily walkthrough (5 minutes in the morning)
- Yesterday's daily provider-check workflow result (GitHub Actions)
- Grafana Overview: any error spikes in the last 24h?
- LLM Slots: rpd_used/rpd_limit heatmap — any slot maxed out several days running? Add a key or rebalance groups
- Supabase Calls/s by Table: any table suddenly seeing 10x the usual traffic?
Scenario D — Quota emergency (every chat slot exhausted)
1. Check deeppin_llm_slot_recovery_seconds → when's the earliest slot coming back
2. Look at the fallback chain: chat falls back to summarizer; check its remaining capacity
3. Short-term fix: add a key for the affected provider
- Edit xxx_API_KEYS in compose.env
- docker compose up -d --force-recreate backend
(restart does NOT reload env — you MUST --force-recreate)
4. Long-term fix: add a new provider to ModelSpecOne gotcha: Python reads env into memory once at process start and never re-reads. So adding a key requires container recreation, not just restart. This is documented in CLAUDE.md specifically because we burned ourselves on it.
§ 044. Dashboard panel cheatsheet
Incident type → which row to open first ──────────────────────────────────────────────── Site-wide 5xx → Overview Specific endpoint slow → HTTP → P95 by Handler AI replies wrong/failing → LLM Slots → Failures/s + slot_score Search broken → Components → SearXNG Calls/s by Status Login / history failing → Components → Supabase Attachment upload broken → Components → Embedding
§ 055. Current blind spots
Re-ranked by operational priority (the previous post covered them; this is the priority order):
- 🔴 No Alertmanager: incidents are discovered either by users or by manual polling. The bare minimum — up==0 for 2m → Telegram bot — should land first
- 🟡 SSE TTFB not instrumented: the metric users feel most ("AI is slow") is the one we can't see
- 🟡 No tracing: for a slow request, we can't tell whether context build, LLM call, or Supabase write is to blame
- 🟢 Logs not in Loki: tail -f app.log is fine at current volume; revisit when traffic grows
§ 066. Closing
Monitoring has a characteristic ROI curve: all investment upfront (instrumentation, stack setup, dashboard tuning) with no direct value for a long time. Then one day there's an incident and it saves you. After that, every subsequent incident compounds the value.
That's what SOPs are for: they turn "monitoring can save you" from "if you happen to remember where to look that day" into "follow the procedure, locate the issue in three minutes". Dashboards are the tool; the SOP is the muscle memory.