Deeppin/ articles
Article · SmartRouter

Stress-Testing SmartRouter: 2.59M Theoretical TPM — How Much Do We Actually Get?

Drive all four SmartRouter groups (chat / merge / summarizer / vision) in parallel and the gap between theoretical and realized capacity becomes very visible. At ~600 tok/req: 275K TPM (11% of theoretical, RPM-bound). At ~5500 tok/req: 822K TPM (32%, TPM-bound). Three structural bottlenecks surfaced: vision has no fallback chain, SambaNova latency explodes to 100s under load, and Gemini is severely under-weighted in score routing.

2026-05-0131 min readSmartRouterstress-testbenchmarkscapacity

An earlier post (free-tier-capacity) summed up ModelSpec.tpm × num_keys across all slots and got ~2.59M TPM — a satisfying number, but it assumes every slot is maxed out simultaneously. In reality RPM saturates first, groups compete for the same shared slot, providers degrade under load, score routing has its own preferences. Time to actually run a load test and see what SmartRouter delivers.

§ 011. How the Theoretical Ceiling Is Computed

Sum over unique slots — a (provider, model) that lives in multiple groups (e.g., gemini-2.5-flash sits in chat / merge / vision) only counts once because the same Slot instance is referenced from every _slots_by_group[group] list, with one shared UsageBucket.

system ceiling = Σ over unique (provider, model) :  spec.tpm × num_keys[provider]

Production keys: groq=3, cerebras=3, sambanova=3, gemini=2, openrouter=3. Resulting ceiling:

provider     per-key TPM   × keys    contribution
groq         70K           × 3       210K
cerebras     120K          × 3       360K
sambanova    300K          × 3       900K
gemini       500K          × 2       1,000K
openrouter   40K           × 3       120K
────────────────────────────────────────────
total                                 2,590K TPM

corresponding RPM ceiling: 1,280

Key observation: to saturate TPM and RPM simultaneously, the average request needs 2,023 tokens (2.59M / 1,280). Gemini's per-slot ratio is the most extreme — 250K TPM at 10 RPM means each request needs 25K tokens to push both in sync. Conversation-sized requests are a few hundred tokens, so Gemini will always hit RPM first.

iGemini same-project caveat: if both GEMINI_API_KEYS belong to the same GCP project, the quota is project-scoped and does not multiply. The theoretical ceiling drops to ~2.09M in that case. Configuration-dependent.

§ 022. Methodology

Wrote stress_tpm.py (~260 lines). Three modes: theoretical (zero quota, prints config), single (each slot bypassing SmartRouter via direct litellm), aggregate / total (through router.completion). This run uses --mode total: all four groups in parallel, 30 in-flight per group, 90 seconds.

Key design decisions:

  • Reuses the prod SmartRouter singleton: imports services.llm_client.router. Slot picking, score, 429 retry, fallback chain — all prod logic. docker exec is a fresh process so UsageBucket state is independent, but provider-side quota is shared
  • Real token counts: read response.usage.prompt_tokens + completion_tokens (the same numbers the provider uses for quota accounting). No estimation
  • Request size control is estimated: pre-flight ~4 chars/token padding, actual prompt_tokens varies ±20% — but this only affects 'how much we thought we sent,' not 'how much actually went into the TPM bucket'
  • TPM bucketed by completion time: m = int(t // 60). Each minute bucket sums prompt_tokens + completion_tokens, successful requests only. 429s and errors do not contribute tokens

To probe both RPM-bound and TPM-bound boundaries, two configs:

  • Config A: input 400 / output 200, ~600 tok/req — typical conversation
  • Config B: input 4000 / output 1500, ~5500 tok/req — long context (attachments, merge, etc.)

§ 033. Config A — Conversation-Sized Requests

min  rpm   in_tpm    out_tpm   total_tpm   429
  0  691   249,751    25,606    275,357   473
  1  248    87,524     6,217     93,741   264
  2   95    35,807     4,287     40,094     0
  3   10     3,583       217      3,800     0
  4   14     5,022       491      5,513     0
  6   30    10,880       983     11,863     0

  peak TPM observed: 275,357   (11% of theoretical 2,590,000)
  peak RPM observed:     691   (54% of theoretical 1,280)

Peak TPM 275K (11% of theoretical), peak RPM 691 (54%). **RPM is the real bottleneck**; TPM utilization is only 11%. Matches intuition: 600 tok/req × 1,280 RPM = 768K TPM physical ceiling, far below the 2.59M theoretical — small requests always saturate RPM first.

Less obvious findings:

  • m=0 → m=2 sharp decay: 691 → 248 → 95 RPM. The first 60s burnt through many slots' rolling 60s windows; recovery in minutes 2–3 was incomplete
  • Vision wiped out: 737 × 429, 0 success. chat / merge fight over Gemini-flash + Groq-scout, and vision has no fallback chain (FALLBACK_CHAIN["vision"] = []), so it gets nothing
  • Gemini severely under-utilized: 50 reqs / 21K tokens, well below the 25 RPM × 1.5min ≈ 37-req physical max. SmartRouter does pick Gemini, but occasional 503s + high latency drag down its score
  • SambaNova published RPD = 20, measured ran 84 reqs. Either docs are stale, or SN counts at org-level rather than per-key

Per-slot contribution (top 8)

slot                                  reqs   tokens   p50_ms
cerebras/llama3.1-8b                   226   85,626      351
groq/llama-4-scout                     234   83,285      506
groq/llama-3.1-8b-instant              109   41,247      395
groq/llama-3.3-70b                      85   32,202      306
sambanova/Meta-Llama-3.3                84   31,920      709
sambanova/DeepSeek-V3.2                 55   21,120      863
sambanova/Llama-4-Maverick              55   19,195      776
groq/openai/gpt-oss-20b                 38   21,778      442
gemini/gemini-2.5-flash                 25   13,099    2,250  ← under-used
gemini/gemini-2.5-flash-lite            25    8,481      607  ← under-used
cerebras/qwen-3-235b                     8    2,820  226,931  ← latency blowup

§ 044. Config B — Long-Context Requests

min  rpm   in_tpm     out_tpm   total_tpm   429
  0  258   812,546     9,978    822,524   407
  1   63   191,476     1,232    192,708   269
  2  112   346,489     3,893    350,382     0
  3   18    54,752       289     55,041     0
  4   12    36,501       646     37,147     0
  5    2     6,066        26      6,092     0
  6   36   109,657       398    110,055     0

  peak TPM observed: 822,524   (32% of theoretical 2,590,000)
  peak RPM observed:     258   (20% of theoretical 1,280)

Peak TPM 822K (32% of theoretical, 3× over A); peak RPM 258 (lower because each request takes longer). Now **TPM is finally the bottleneck**, but still only 1/3 of theoretical.

Surprises:

  • m=2 rebound to 350K TPM: m=0 burnt the fast slots' rolling windows, by m=2 their windows fully reset and the long tail of large requests started landing. This is closer to the true sustained TPM
  • SambaNova p50 = 104s: content goes through, but each request takes 100+ seconds. Real users would have timed out long before. SN's quota at high-concurrency large-request load is effectively 'unusable capacity'
  • Cerebras qwen-3-235b p50 = 107s: same story as SN, looks like internal queueing
  • Vision wiped out again: 676 × 429. Missing fallback shows up consistently in both A and B
  • Gemini still under-used: 20 reqs / 64K tokens, while 250K TPM × 90s × 2 keys / 60 = 750K tokens physical max — 8.5% utilization

§ 055. Three Structural Bottlenecks

5.1 Vision has no fallback chain

# backend/services/llm_client.py
FALLBACK_CHAIN: dict[str, list[str]] = {
    "chat": ["summarizer"],
    "merge": ["chat", "summarizer"],
    "summarizer": ["chat"],
    "vision": [],     # ← here
}

Vision has only 2 slots: gemini-2.5-flash and groq/llama-4-scout. Both also serve chat and merge. Once chat / merge drain those slots, vision has nowhere to go. A single line — FALLBACK_CHAIN["vision"] = ["chat"] — unlocks ~3M TPM of backup capacity (other vision-capable slots in the chat group).

iNeeds verification first: which chat-group slots actually accept image input? If a slot we add as vision fallback doesn't support images, calls will 400 instead of working. The fix must be paired with a careful audit of ModelSpec.groups vision tagging.

5.2 SambaNova latency explodes under load

In Config B, sambanova/Meta-Llama-3.3 had p50 = 104 seconds. p50, not p99 — half the requests took 100+ seconds. SambaNova clearly has internal queueing under high concurrency (insufficient inference nodes), but it doesn't return 429 — it silently waits.

The current score function looks at quota dimensions only (rpm / tpm / rpd), not latency:

def score(self, spec):
    rpm_r = max(0, spec.rpm - self.rpm_used) / spec.rpm
    tpm_r = max(0, spec.tpm - self.tpm_used) / spec.tpm
    rpd_r = max(0, spec.rpd - self.rpd_used) / spec.rpd
    return min(rpm_r, tpm_r, rpd_r)

Result: SN looks like it has ample quota (high RPD, high TPM), so SmartRouter keeps dispatching to it — and every request takes 100s. Fix direction: track p95 latency in UsageBucket, multiply score by a penalty when it crosses a threshold (e.g., 10s). The threshold needs care, though: cold-start models can take 5–10s on the first request and you don't want to permanently exile them.

5.3 Gemini is under-weighted in score routing

Gemini has the largest per-slot capacity (250K TPM × 2 keys = 500K, 19% of system total), but in both runs it contributed only 2–3% of the total tokens. Quota was not the issue — Gemini's 250–1000 RPD is nowhere near depleted in 90s. SmartRouter's score simply ranked it low.

Three compounding factors:

  • Gemini's occasional 503 high-demand: record_failure sets _last_fail_ts to now, and the score gets multiplied by a penalty over a 30-second half-life
  • Gemini latency is structurally higher than Cerebras / Groq (2–3s vs 300–700ms)
  • Hard 10 RPM cap: 10 requests in a minute and score = 0; the remaining 50s is wasted

The compounding effect: Gemini gets picked occasionally, runs slowly, scores poorly, and combined with the very low RPM cap, SmartRouter's 'pick the best' machinery routes most traffic to Cerebras / Groq — neither of which has Gemini's headroom. Improvement direction: faster decay on fail penalty for high-TPM slots (10s half-life instead of 30s), or capacity-tiered grace periods.

§ 066. How to Read These Numbers

'800K TPM peak' is not 'we can sustain 800K TPM in production.' Three translations are needed:

  • Burst vs sustained: m=0 is the 60-second burst peak; m=2 onward is closer to true sustained. Config B's m=2 of 350K TPM is the 'sustainable for >1 minute' ceiling
  • Test scenario vs real traffic: the test drives all four groups simultaneously, but real traffic is 90%+ chat. Merge / summarizer / vision rarely run hot at once
  • Usable vs nominal capacity: SambaNova nominally contributed 318K tokens under load, but p50 = 104s — to real users, that's effectively unusable capacity

Real usable estimate:

actual sustainable TPM ≈
  system steady-state 350K (Config B m=2)
  × 0.7 (deduct SambaNova 'unusable' share)
  × 0.9 (after fixing vision fallback)
  ≈ 220K TPM sustained

at ~2.5K tok/msg: ~5,300 msg/min sustained
at 6 turn/user/day, peak hour = 20% of daily: ~440 peak-hour DAU

This looks much smaller than the 9,000–15,000 DAU figure in free-tier-capacity — but that one assumed users spread evenly across 24 hours and counted total daily quota. This number is peak-hour concurrent users the system can handle without degradation. Both are correct; they answer different questions.

§ 077. Next Steps

  • Now (one-line change): FALLBACK_CHAIN["vision"] = ["chat"], paired with validation of which chat-group slots are actually vision-capable
  • Short-term: add latency tracking to UsageBucket + a latency dimension in score, fixing the SambaNova queueing problem
  • Short-term: shorten fail-penalty decay for high-TPM slots like Gemini from 30s to 10s — should noticeably increase its actual utilization
  • Mid-term: run a daily check to remeasure SambaNova's real RPD, and update ModelSpec from published 20 to the measured value (if docs really are stale)
  • Mid-term: group-typed score weighting — latency-sensitive groups (chat) prefer fast slots; batch-style groups (merge) prefer high-TPM slots
  • Long-term: local quantized inference (llama.cpp) as ultimate fallback; a paid Gemini key in a separate GCP project to break past the 250K shared-project ceiling

A few observations deserve their own write-ups: 'why SmartRouter knows a slot is slow but can't down-rank it' (latency-blindness as a structural blind spot in the score), 'how big the gap is between provider documentation and actual enforced RPD' (compliance-as-published vs compliance-as-enforced), and the most counter-intuitive one — 'the bigger a slot's theoretical capacity, the lower its actual utilization tends to be' (the Gemini paradox). Next time.