Stress-Testing SmartRouter: 2.59M Theoretical TPM — How Much Do We Actually Get?
Drive all four SmartRouter groups (chat / merge / summarizer / vision) in parallel and the gap between theoretical and realized capacity becomes very visible. At ~600 tok/req: 275K TPM (11% of theoretical, RPM-bound). At ~5500 tok/req: 822K TPM (32%, TPM-bound). Three structural bottlenecks surfaced: vision has no fallback chain, SambaNova latency explodes to 100s under load, and Gemini is severely under-weighted in score routing.
An earlier post (free-tier-capacity) summed up ModelSpec.tpm × num_keys across all slots and got ~2.59M TPM — a satisfying number, but it assumes every slot is maxed out simultaneously. In reality RPM saturates first, groups compete for the same shared slot, providers degrade under load, score routing has its own preferences. Time to actually run a load test and see what SmartRouter delivers.
§ 011. How the Theoretical Ceiling Is Computed
Sum over unique slots — a (provider, model) that lives in multiple groups (e.g., gemini-2.5-flash sits in chat / merge / vision) only counts once because the same Slot instance is referenced from every _slots_by_group[group] list, with one shared UsageBucket.
system ceiling = Σ over unique (provider, model) : spec.tpm × num_keys[provider]
Production keys: groq=3, cerebras=3, sambanova=3, gemini=2, openrouter=3. Resulting ceiling:
provider per-key TPM × keys contribution groq 70K × 3 210K cerebras 120K × 3 360K sambanova 300K × 3 900K gemini 500K × 2 1,000K openrouter 40K × 3 120K ──────────────────────────────────────────── total 2,590K TPM corresponding RPM ceiling: 1,280
Key observation: to saturate TPM and RPM simultaneously, the average request needs 2,023 tokens (2.59M / 1,280). Gemini's per-slot ratio is the most extreme — 250K TPM at 10 RPM means each request needs 25K tokens to push both in sync. Conversation-sized requests are a few hundred tokens, so Gemini will always hit RPM first.
§ 022. Methodology
Wrote stress_tpm.py (~260 lines). Three modes: theoretical (zero quota, prints config), single (each slot bypassing SmartRouter via direct litellm), aggregate / total (through router.completion). This run uses --mode total: all four groups in parallel, 30 in-flight per group, 90 seconds.
Key design decisions:
- Reuses the prod SmartRouter singleton: imports services.llm_client.router. Slot picking, score, 429 retry, fallback chain — all prod logic. docker exec is a fresh process so UsageBucket state is independent, but provider-side quota is shared
- Real token counts: read response.usage.prompt_tokens + completion_tokens (the same numbers the provider uses for quota accounting). No estimation
- Request size control is estimated: pre-flight ~4 chars/token padding, actual prompt_tokens varies ±20% — but this only affects 'how much we thought we sent,' not 'how much actually went into the TPM bucket'
- TPM bucketed by completion time: m = int(t // 60). Each minute bucket sums prompt_tokens + completion_tokens, successful requests only. 429s and errors do not contribute tokens
To probe both RPM-bound and TPM-bound boundaries, two configs:
- Config A: input 400 / output 200, ~600 tok/req — typical conversation
- Config B: input 4000 / output 1500, ~5500 tok/req — long context (attachments, merge, etc.)
§ 033. Config A — Conversation-Sized Requests
min rpm in_tpm out_tpm total_tpm 429 0 691 249,751 25,606 275,357 473 1 248 87,524 6,217 93,741 264 2 95 35,807 4,287 40,094 0 3 10 3,583 217 3,800 0 4 14 5,022 491 5,513 0 6 30 10,880 983 11,863 0 peak TPM observed: 275,357 (11% of theoretical 2,590,000) peak RPM observed: 691 (54% of theoretical 1,280)
Peak TPM 275K (11% of theoretical), peak RPM 691 (54%). **RPM is the real bottleneck**; TPM utilization is only 11%. Matches intuition: 600 tok/req × 1,280 RPM = 768K TPM physical ceiling, far below the 2.59M theoretical — small requests always saturate RPM first.
Less obvious findings:
- m=0 → m=2 sharp decay: 691 → 248 → 95 RPM. The first 60s burnt through many slots' rolling 60s windows; recovery in minutes 2–3 was incomplete
- Vision wiped out: 737 × 429, 0 success. chat / merge fight over Gemini-flash + Groq-scout, and vision has no fallback chain (FALLBACK_CHAIN["vision"] = []), so it gets nothing
- Gemini severely under-utilized: 50 reqs / 21K tokens, well below the 25 RPM × 1.5min ≈ 37-req physical max. SmartRouter does pick Gemini, but occasional 503s + high latency drag down its score
- SambaNova published RPD = 20, measured ran 84 reqs. Either docs are stale, or SN counts at org-level rather than per-key
Per-slot contribution (top 8)
slot reqs tokens p50_ms cerebras/llama3.1-8b 226 85,626 351 groq/llama-4-scout 234 83,285 506 groq/llama-3.1-8b-instant 109 41,247 395 groq/llama-3.3-70b 85 32,202 306 sambanova/Meta-Llama-3.3 84 31,920 709 sambanova/DeepSeek-V3.2 55 21,120 863 sambanova/Llama-4-Maverick 55 19,195 776 groq/openai/gpt-oss-20b 38 21,778 442 gemini/gemini-2.5-flash 25 13,099 2,250 ← under-used gemini/gemini-2.5-flash-lite 25 8,481 607 ← under-used cerebras/qwen-3-235b 8 2,820 226,931 ← latency blowup
§ 044. Config B — Long-Context Requests
min rpm in_tpm out_tpm total_tpm 429 0 258 812,546 9,978 822,524 407 1 63 191,476 1,232 192,708 269 2 112 346,489 3,893 350,382 0 3 18 54,752 289 55,041 0 4 12 36,501 646 37,147 0 5 2 6,066 26 6,092 0 6 36 109,657 398 110,055 0 peak TPM observed: 822,524 (32% of theoretical 2,590,000) peak RPM observed: 258 (20% of theoretical 1,280)
Peak TPM 822K (32% of theoretical, 3× over A); peak RPM 258 (lower because each request takes longer). Now **TPM is finally the bottleneck**, but still only 1/3 of theoretical.
Surprises:
- m=2 rebound to 350K TPM: m=0 burnt the fast slots' rolling windows, by m=2 their windows fully reset and the long tail of large requests started landing. This is closer to the true sustained TPM
- SambaNova p50 = 104s: content goes through, but each request takes 100+ seconds. Real users would have timed out long before. SN's quota at high-concurrency large-request load is effectively 'unusable capacity'
- Cerebras qwen-3-235b p50 = 107s: same story as SN, looks like internal queueing
- Vision wiped out again: 676 × 429. Missing fallback shows up consistently in both A and B
- Gemini still under-used: 20 reqs / 64K tokens, while 250K TPM × 90s × 2 keys / 60 = 750K tokens physical max — 8.5% utilization
§ 055. Three Structural Bottlenecks
5.1 Vision has no fallback chain
# backend/services/llm_client.py
FALLBACK_CHAIN: dict[str, list[str]] = {
"chat": ["summarizer"],
"merge": ["chat", "summarizer"],
"summarizer": ["chat"],
"vision": [], # ← here
}Vision has only 2 slots: gemini-2.5-flash and groq/llama-4-scout. Both also serve chat and merge. Once chat / merge drain those slots, vision has nowhere to go. A single line — FALLBACK_CHAIN["vision"] = ["chat"] — unlocks ~3M TPM of backup capacity (other vision-capable slots in the chat group).
5.2 SambaNova latency explodes under load
In Config B, sambanova/Meta-Llama-3.3 had p50 = 104 seconds. p50, not p99 — half the requests took 100+ seconds. SambaNova clearly has internal queueing under high concurrency (insufficient inference nodes), but it doesn't return 429 — it silently waits.
The current score function looks at quota dimensions only (rpm / tpm / rpd), not latency:
def score(self, spec):
rpm_r = max(0, spec.rpm - self.rpm_used) / spec.rpm
tpm_r = max(0, spec.tpm - self.tpm_used) / spec.tpm
rpd_r = max(0, spec.rpd - self.rpd_used) / spec.rpd
return min(rpm_r, tpm_r, rpd_r)Result: SN looks like it has ample quota (high RPD, high TPM), so SmartRouter keeps dispatching to it — and every request takes 100s. Fix direction: track p95 latency in UsageBucket, multiply score by a penalty when it crosses a threshold (e.g., 10s). The threshold needs care, though: cold-start models can take 5–10s on the first request and you don't want to permanently exile them.
5.3 Gemini is under-weighted in score routing
Gemini has the largest per-slot capacity (250K TPM × 2 keys = 500K, 19% of system total), but in both runs it contributed only 2–3% of the total tokens. Quota was not the issue — Gemini's 250–1000 RPD is nowhere near depleted in 90s. SmartRouter's score simply ranked it low.
Three compounding factors:
- Gemini's occasional 503 high-demand: record_failure sets _last_fail_ts to now, and the score gets multiplied by a penalty over a 30-second half-life
- Gemini latency is structurally higher than Cerebras / Groq (2–3s vs 300–700ms)
- Hard 10 RPM cap: 10 requests in a minute and score = 0; the remaining 50s is wasted
The compounding effect: Gemini gets picked occasionally, runs slowly, scores poorly, and combined with the very low RPM cap, SmartRouter's 'pick the best' machinery routes most traffic to Cerebras / Groq — neither of which has Gemini's headroom. Improvement direction: faster decay on fail penalty for high-TPM slots (10s half-life instead of 30s), or capacity-tiered grace periods.
§ 066. How to Read These Numbers
'800K TPM peak' is not 'we can sustain 800K TPM in production.' Three translations are needed:
- Burst vs sustained: m=0 is the 60-second burst peak; m=2 onward is closer to true sustained. Config B's m=2 of 350K TPM is the 'sustainable for >1 minute' ceiling
- Test scenario vs real traffic: the test drives all four groups simultaneously, but real traffic is 90%+ chat. Merge / summarizer / vision rarely run hot at once
- Usable vs nominal capacity: SambaNova nominally contributed 318K tokens under load, but p50 = 104s — to real users, that's effectively unusable capacity
Real usable estimate:
actual sustainable TPM ≈ system steady-state 350K (Config B m=2) × 0.7 (deduct SambaNova 'unusable' share) × 0.9 (after fixing vision fallback) ≈ 220K TPM sustained at ~2.5K tok/msg: ~5,300 msg/min sustained at 6 turn/user/day, peak hour = 20% of daily: ~440 peak-hour DAU
This looks much smaller than the 9,000–15,000 DAU figure in free-tier-capacity — but that one assumed users spread evenly across 24 hours and counted total daily quota. This number is peak-hour concurrent users the system can handle without degradation. Both are correct; they answer different questions.
§ 077. Next Steps
- Now (one-line change): FALLBACK_CHAIN["vision"] = ["chat"], paired with validation of which chat-group slots are actually vision-capable
- Short-term: add latency tracking to UsageBucket + a latency dimension in score, fixing the SambaNova queueing problem
- Short-term: shorten fail-penalty decay for high-TPM slots like Gemini from 30s to 10s — should noticeably increase its actual utilization
- Mid-term: run a daily check to remeasure SambaNova's real RPD, and update ModelSpec from published 20 to the measured value (if docs really are stale)
- Mid-term: group-typed score weighting — latency-sensitive groups (chat) prefer fast slots; batch-style groups (merge) prefer high-TPM slots
- Long-term: local quantized inference (llama.cpp) as ultimate fallback; a paid Gemini key in a separate GCP project to break past the 250K shared-project ceiling
A few observations deserve their own write-ups: 'why SmartRouter knows a slot is slow but can't down-rank it' (latency-blindness as a structural blind spot in the score), 'how big the gap is between provider documentation and actual enforced RPD' (compliance-as-published vs compliance-as-enforced), and the most counter-intuitive one — 'the bigger a slot's theoretical capacity, the lower its actual utilization tends to be' (the Gemini paradox). Next time.