Deeppin/ articles
Article · monitoring

Deeppin's Monitoring Stack: How Prometheus + Grafana Got Wired Up

From hand-rolled /debug endpoints to a real Prometheus + Grafana stack — what metrics are exposed, how to access them, and the gotchas we tripped on.

2026-04-1715 min readmonitoringprometheusgrafanaops

Early in Deeppin's life, system visibility came from /health plus a hand-rolled /debug endpoint — uptime, per-slot LLM quota, last few failures. Enough to tell if something was broken right now, but hopeless for questions like "did the 429 rate creep up over the past hour?" On 2026-04-17 we added Prometheus + Grafana to docker-compose. This post walks through how it got wired up.

§ 011. Overall data flow

┌──────────────┐    /metrics      ┌──────────────┐     datasource      ┌──────────┐
│ backend:8000 │ ───────────────▶ │ prometheus   │ ──────────────────▶ │ grafana  │
│ (FastAPI)    │   scrape 15s     │ :9090 loop   │   PromQL queries    │ :3000    │
└──────────────┘                  └──────────────┘                     └──────────┘
                                        │ 90d / 10GB retention              │
                                        ▼                                   ▼
                                  prometheus_data vol              /grafana/ via nginx

Three deliberate choices: Prometheus binds to 127.0.0.1 only (SSH tunnel to reach it), Grafana sits behind nginx at /grafana/, and backend /metrics is blocked at nginx so only the compose-internal prometheus container can scrape it. Rationale below.

§ 022. What's instrumented

Three layers, each with a different strategy.

2.1 HTTP layer: automatic

prometheus-fastapi-instrumentator injects middleware at FastAPI startup that records handler / method / status / duration as Histograms for every request. No hand-written code; every /api/** is covered.

2.2 Component layer: manual Counter + Histogram

embedding / searxng / supabase calls are instrumented by hand. Example for embedding:

EMBEDDING_CALLS = Counter(
    "deeppin_embedding_calls_total",
    "Number of embedding batch calls",
    ["result"],  # ok / error
)
EMBEDDING_DURATION = Histogram(
    "deeppin_embedding_duration_seconds",
    "Embedding batch duration",
    buckets=(0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10),
)

Why not auto-instrument: these are function-level calls, not HTTP events. And we want a result label (ok / error / timeout) that automatic tooling can't infer.

2.3 LLM slot state: custom Collector

This layer is unusual. Each slot in SmartRouter has rpm_used / tpm_used / rpd_used / tpd_used counters that move on every request. Updating Gauges with .set() inside record_request would add write contention and lock overhead for data we don't need at that resolution.

Instead, a custom Collector: Prometheus calls collect() on each scrape (every 15s). We walk the slots, read current state, and yield GaugeMetricFamily values. Reads are a single snapshot; writes touch nothing.

class _LLMSlotCollector(Collector):
    def collect(self):
        for slot in llm_router.slots:
            key_prefix = slot.api_key[:8]
            labels = [slot.spec.provider, slot.spec.model_id, key_prefix]
            # yield rpm_used / tpm_used / rpd_used / tpd_used
            # yield rpm_limit / ... / tpd_limit
            # yield slot.score()
            # yield slot.seconds_until_recovery()

key_prefix is the first 8 chars of the API key — lets us distinguish accounts when a provider has multiple keys configured.

§ 033. Access

  • Grafana: https://deeppin.duckdns.org/grafana/, user admin, bootstrap password in compose.env's GRAFANA_ADMIN_PASSWORD. Changing it in the UI persists to the grafana_data volume.
  • Prometheus UI: 127.0.0.1:9090 (loopback only). Tunnel from your laptop: ssh -L 9090:127.0.0.1:9090 oracle, then open http://localhost:9090
  • Backend /metrics: returns 404 publicly (nginx: location = /metrics { return 404 }). Only the compose-internal prometheus container can scrape backend:8000/metrics.

Why Prometheus isn't public: no auth layer built in — exposing it lets anyone pull historical metrics and run arbitrary PromQL. Not worth the risk. Grafana has accounts, so it can be public.

§ 044. The Grafana sub-path trap

Grafana runs with GF_SERVER_SERVE_FROM_SUB_PATH=true + GF_SERVER_ROOT_URL=https://deeppin.duckdns.org/grafana/, nginx proxies at location /grafana/. Looks straightforward. First deploy: instant login redirect loop.

The trap is the trailing slash on proxy_pass:

# Wrong: nginx strips /grafana/ prefix, proxy hits http://grafana:3000/
location /grafana/ {
    proxy_pass http://grafana:3000/;
}

# Right: preserve /grafana/ prefix
location /grafana/ {
    proxy_pass http://grafana:3000;
}

SERVE_FROM_SUB_PATH=true means Grafana expects to **receive** the full /grafana/ path. Strip the prefix at nginx and Grafana sees /, decides the user is unauthenticated, 302-redirects to /grafana/login (since it thinks its root is /grafana/), browser bounces back to /, infinite loop.

§ 055. The No-Data dashboard trap

Second trap: the dashboard JSON hardcoded datasource uid: "prometheus", but on first boot Grafana auto-generated a hash UID for the provisioned datasource. Every panel's query referenced a datasource UID that didn't exist → No data everywhere.

Fix is to pin the UID in datasources/prometheus.yml and add a deleteDatasources block to force-rebuild stale ones:

deleteDatasources:
  - name: Prometheus
    orgId: 1
datasources:
  - name: Prometheus
    uid: prometheus   # fixed UID so panel JSON can reference it
    type: prometheus
    url: http://prometheus:9090

§ 066. What's not done yet

  • Alertmanager: collecting only, no alerts. First rule I want is up{job="deeppin-backend"} == 0 for 2m firing to the Telegram bot — not wired yet.
  • Log aggregation: no Loki / ELK. Still tail -f /app/logs/app.log.
  • Tracing: no OpenTelemetry / Jaeger. Request traces are invisible.
  • Automatic backup: prometheus_data / grafana_data go unbacked. Historical metrics are already capped at 90d; dashboards come back via provisioning.

The rule is simple: what you can't see, you can't improve. But adding complexity too early is its own failure mode. Get the dashboards running, feed them a few days of real traffic, see which panels are noise and which are missing — then decide whether alerting and tracing are worth their weight.