SmartRouter: Multi-Provider Free Tier Stacking with Intelligent Routing
Deeppin's SmartRouter stacks free-tier quotas from 5 providers (Groq, Cerebras, SambaNova, Gemini, OpenRouter) across 15 models. Real-time usage tracking and proactive scoring select the best (provider, model, key) slot before 429 errors occur.
A single LLM provider's free tier is limited — Groq allows 30 requests per minute, Gemini 10. But quotas across different providers are completely independent. SmartRouter's core idea: pool all providers' quotas into a unified resource pool, using real-time usage tracking to pick the optimal path.
§ 01Part 1 — Aggregate capacity: how many GPUs would this take to self-host
The numbers first. At the current production setup (2 Groq×6 + 3 Cerebras×2 + 3 SambaNova×3 + 2 Gemini×2 + 3 OpenRouter×4 = 13 keys across 43 slots), running full tilt:
- Peak throughput ~2.52M TPM ≈ 42,000 tokens/sec sustained
- Daily budget ~236M tokens/day — roughly 94K messages at 2.5K tokens each
- Largest callable models: 671B MoE / 37B active (SambaNova's DeepSeek-V3.2), 400B MoE / 17B active (SambaNova's Llama-4-Maverick)
Hardware equivalent: 42K tok/s sustained is roughly 1.5× what a fully loaded 8×H100 DGX delivers — ~12 saturated H100s, or ~$600K hardware + ~$15K/month in colo/power/cooling/ops, or ~$30K/month amortized over 3 years. Deeppin's actual monthly bill: $0. Five free tiers stacked happens to cover exactly this throughput band.
Model diversity is a separate story. The largest models in the pool are 671B MoE (DeepSeek-V3.2) and 400B MoE (Llama-4-Maverick). Self-hosting a model at that scale needs 16+ H100s just to load weights at FP8 ($400K+ of hardware for that single model), or rents at $50-100/hour. Here it's free and on-demand.
§ 02Part 2 — How to scale (and why you probably won't need to)
If this pool ever actually saturates, the scale-out path from cheap to expensive:
2.1 Horizontal scale: add keys
Each provider's API_KEYS env var in backend/.env is a JSON array, and `llm_client.py`'s `_load_keys` parses it — each key expands into its own set of slots. Adding one Groq key gives you 5 more slots (one per model), instantly doubling chat / merge / summarizer throughput.
# Example: Groq from 2 keys to 3 keys # backend/.env GROQ_API_KEYS=["gsk_key1","gsk_key2","gsk_key3"] # At startup SmartRouter sees 3 keys × 6 models = 18 slots # RPM/TPM/RPD/TPD scale linearly by 50% (was 2 keys → 12 slots)
Constraint: each key requires a new account (Groq/SambaNova want email + phone, Gemini wants a new Google account, OpenRouter a new signup). OpenRouter's :free tier has upstream global throttling, so additional keys there see diminishing returns.
2.2 Vertical scale: pay to upgrade
- OpenRouter: add $10 credit → RPD jumps from 200 to 1000 (5×), one-time payment, permanent
- Groq paid tier → RPM/TPM go up by orders of magnitude, billed per token
- Gemini paid tier → RPD ceiling from 1000 to 10000+, billed per token
- SmartRouter doesn't distinguish free from paid slots — bump the ModelSpec's rpm/tpm/rpd and routing logic stays identical
2.3 Add a new provider
LiteLLM is compatible with 100+ providers. Add a new row to `llm_client.py`'s ModelSpec list and a matching env read in `_load_keys`. DeepInfra (cheap 70B inference), Together (400B+ models), Fireworks, Anthropic, OpenAI itself — all plug in without further changes.
But — solo project, single-digit DAU, current architecture's theoretical capacity is 1000× over actual use. The scale-out path is documented here because knowing where the ceiling is — and that breaking through takes 5 minutes — matters more than actually scaling. The rest of this post explains how the current structure supports this order of magnitude in the first place.
§ 03Part 3 — Why not LiteLLM Router
The previous approach used LiteLLM's built-in Router, which is reactive — send request → get 429 → retry next. Each 429 wastes a round-trip (200-500ms), and users notice the stutter. SmartRouter switches to proactive selection: score all slots before sending, drastically reducing 429 occurrences.
§ 04Part 4 — Data structures
SmartRouter has three layers of data structures:
ModelSpec — Model specification (static config) provider: "groq" | "cerebras" | "sambanova" | "gemini" | "openrouter" model_id: "llama-3.3-70b-versatile" rpm / tpm / rpd / tpd: rate limits groups: ["chat", "merge"] ← which groups it belongs to Slot = ModelSpec + API Key + UsageBucket A slot is the smallest routing unit e.g.: (groq/llama-3.3-70b, gsk_key1, usage_bucket) e.g.: (groq/llama-3.3-70b, gsk_key2, usage_bucket) ← same model, different key = different slot UsageBucket — Usage tracking (independent per slot) rpm_used / tpm_used → auto-reset every 60 seconds rpd_used / tpd_used → reset on natural-day rollover in spec.reset_tz (Gemini=PT, others=UTC) fail_count / last_fail_ts → failure penalty
§ 05Part 5 — Five providers × 15 models — categorized and ranked
15 models split into 4 use-case groups (chat / merge / summarizer / vision). Within each group, ranked by capability for that task: chat by model size, merge by maximum input it can fit (TPM-dominated), summarizer by output speed, vision by multimodal quality. Rate limits below are per-slot — multiply by key count for total contribution.
chat (main conversation) — sorted by model size, descending
# Provider/Model Size RPM TPM RPD TPD ──────────────────────────────────────────────────────────────────────────────────────────── 1 sambanova/DeepSeek-V3.2 671B MoE / 37B act 20 100K 20 200K 2 sambanova/Llama-4-Maverick-17B-128E-Instruct 400B MoE / 17B act 20 100K 20 200K 3 cerebras/qwen-3-235b-a22b-instruct-2507 235B MoE / 22B act 30 60K 14.4K 1M 4 openrouter/nvidia/nemotron-3-super-120b:free 120B MoE 20 10K 50 2M 5 openrouter/openai/gpt-oss-120b:free 120B 20 10K 50 2M 6 groq/openai/gpt-oss-120b 120B 30 8K 1K 200K 7 groq/meta-llama/llama-4-scout-17b-16e-instruct 109B MoE / 17B act 30 30K 1K 500K 8 openrouter/qwen/qwen3-next-80b-a3b:free 80B MoE / 3B act 20 10K 50 2M 9 sambanova/Meta-Llama-3.3-70B-Instruct 70B 20 100K 20 200K 10 groq/llama-3.3-70b-versatile 70B 30 12K 1K 100K 11 openrouter/meta-llama/llama-3.3-70b:free 70B 20 10K 50 2M 12 groq/qwen/qwen3-32b 32B 60 6K 1K 500K 13 groq/openai/gpt-oss-20b 20B 30 8K 1K 200K 14 gemini/gemini-2.5-flash Small but tuned 10 250K 250 50M 15 gemini/gemini-2.5-flash-lite Smallest 15 250K 1K 50M
merge (combined output) — sorted by max single-call merge size
Merge calls often pack 5-15K tokens at once (multiple sub-threads combined). TPM dominates — models with low TPM saturate the per-minute quota in a single merge.
# Provider/Model Max single merge RPM TPM RPD TPD ────────────────────────────────────────────────────────────────────────────────────────── 1 gemini/gemini-2.5-flash ~250K tokens 10 250K 250 50M 2 sambanova/Meta-Llama-3.3-70B-Instruct ~100K tokens 20 100K 20 200K 3 sambanova/Llama-4-Maverick-17B-128E ~100K tokens 20 100K 20 200K 4 sambanova/DeepSeek-V3.2 ~100K tokens 20 100K 20 200K 5 cerebras/qwen-3-235b-a22b-instruct-2507 ~60K tokens 30 60K 14.4K 1M 6 groq/meta-llama/llama-4-scout-17b-16e ~30K tokens 30 30K 1K 500K 7 groq/llama-3.3-70b-versatile ~12K tokens (full) 30 12K 1K 100K
summarizer (summaries / classification / formatting) — sorted by speed
Summarizer is for lightweight internal tasks (compact summaries, intent classification, JSON formatting). Inputs and outputs are short but calls are frequent — speed matters more than size.
# Provider/Model Speed (tok/s) RPM TPM RPD TPD ─────────────────────────────────────────────────────────────────────────────────────── 1 cerebras/llama3.1-8b ~3,000 (fastest) 30 60K 14.4K 1M 2 gemini/gemini-2.5-flash-lite ~500, very high TPM 15 250K 1K 50M 3 groq/llama-3.1-8b-instant ~750 30 6K 14.4K 500K
vision (image understanding) — sorted by multimodal quality
# Provider/Model Multimodal note RPM TPM RPD TPD ──────────────────────────────────────────────────────────────────────────────────────────── 1 gemini/gemini-2.5-flash Native Google, best 10 250K 250 50M 2 groq/meta-llama/llama-4-scout-17b-16e Llama 4 multimodal, fast 30 30K 1K 500K
§ 06Part 6 — Scoring mechanism
Each slot's availability score is computed in real-time by its UsageBucket:
def score(self, spec: ModelSpec) -> float:
# Remaining ratio per dimension
rpm_r = (spec.rpm - self.rpm_used) / spec.rpm
tpm_r = (spec.tpm - self.tpm_used) / spec.tpm
rpd_r = (spec.rpd - self.rpd_used) / spec.rpd
# Minimum — bucket effect, tightest dimension determines availability
s = min(rpm_r, tpm_r, rpd_r)
# Penalty for recent failures (30-second half-life)
if self._fail_count > 0:
elapsed = now - self._last_fail_ts
penalty = 0.5 ** (elapsed / 30)
s *= (1 - penalty)
return s # 0 = exhausted, 1.0 = full capacityScoring granularity is provider + model + key. The same Groq key's llama-70b and qwen3-32b have independent usage buckets, because Groq's rate limits are per-model per-key.
§ 07Part 7 — Selection and fallback flow
The complete request routing flow:
router.completion(group="chat", messages=...)
│
├── Step 1: Get all slots from the chat group
│ Sort by score descending, pick highest for the request
│ (add small random jitter to avoid thundering herd)
│
├── Success → return result
│
├── Failure (429/5xx) → mark failure, try next slot
│ ... all chat slots failed ...
│
├── Enter fallback chain
│ chat → summarizer
│ merge → chat → summarizer
│ summarizer → chat
│
└── All exhausted → pick soonest-recovery slot
slot A: rpm full, 45 seconds until reset
slot B: rpd full, 8 hours until reset
→ pick slot A§ 08Part 8 — Time-window auto-reset
UsageBucket counters: minute window is a rolling 60s based on time.monotonic(); day window is aligned to the natural date boundary in spec.reset_tz (Gemini=America/Los_Angeles, others=UTC), zeroing on rollover so it matches the provider's actual 00:00 reset rather than drifting from process start.
- rpm_used / tpm_used: auto-reset when ≥60 seconds since last reset
- rpd_used / tpd_used: zero out when the provider's timezone rolls to a new date
- Checked automatically before every record_request() and score() call
- No timers or background threads needed — lazy reset, zero overhead
§ 09Part 9 — Proactive vs reactive
This is the core difference between SmartRouter and traditional routing:
Old approach (LiteLLM Router, reactive): Send request → get 429 → retry next → 429 again → retry again Each 429 wastes 200-500ms round-trip User experience: occasional noticeable stutter New approach (SmartRouter, proactive): Score all slots → pick best → send request Slots with score=0 are never selected, 429 probability drops sharply User experience: nearly invisible routing switches
§ 10Part 10 — Deployment configuration
Adding a new provider requires only an environment variable — SmartRouter auto-discovers it:
# backend/.env — each value is a JSON array, supports multi-key stacking GROQ_API_KEYS=["gsk_key1", "gsk_key2"] CEREBRAS_API_KEYS=["csk_key1", "csk_key2"] SAMBANOVA_API_KEYS=["sk_key1", "sk_key2"] GEMINI_API_KEYS=["AIza_key1"] OPENROUTER_API_KEYS=["sk-or-v1-key1", "sk-or-v1-key2"] # Unconfigured providers produce zero slots, no impact on operation # Restart after adding keys to take effect
GitHub Actions deployment automatically syncs keys to the server via Secrets — no manual SSH needed.
§ 11Part 11 — Health check
GET /health/providers/keys validates each (provider, key) against /v1/models and detects catalog drift without burning any LLM quota — the daily CI runs this endpoint. GET /health/providers/full actually fires a completion against every slot (consumes quota; used only for manual deep checks):
GET /health/providers/keys
{
"total": 13,
"ok": 12,
"failed": 1,
"results": [
{"provider": "groq", "key": "gsk_abc1...", "ok": true, "models_seen": 6},
{"provider": "cerebras", "key": "csk_xyz...", "ok": true, "models_seen": 2},
{"provider": "sambanova", "key": "sk_...", "ok": false, "error": "401 Unauthorized"},
...
]
}§ 12Part 12 — Actual capacity estimate
Current production config: 2 Groq + 3 Cerebras + 3 SambaNova + 2 Gemini + 3 OpenRouter = 13 keys, 43 slots total (weighted by per-provider model count: 2×6 + 3×2 + 3×3 + 2×2 + 3×4).
- Chat group: 38 slots, aggregate ~920 RPM, ~2.33M TPM, ~56.5K RPD, ~232M TPD
- Merge group: 18 slots, aggregate ~410 RPM, ~1.66M TPM, ~47.9K RPD, ~106M TPD
- Summarizer group: 7 slots, aggregate ~180 RPM, ~692K TPM, ~74K RPD, ~104M TPD
- Vision group: 4 slots, aggregate ~80 RPM, ~560K TPM, ~2.5K RPD, ~101M TPD
- Global peak TPM: ~2.52M (≈ 42,000 tok/s sustained, roughly 1.5× a fully loaded 8×H100 DGX / ~12 saturated H100s)
- Global daily quota: ~236M tokens/day; at 2.5K tokens/message, supports ~94K messages/day
- Equivalent DAU: ~9,000-15,000 active users (6 conversation turns per user per day)
Compared to the original Groq-only setup (300-600 DAU), five-provider stacking lifts capacity by ~25x. The bottleneck is chat-group RPD (56.5K/day; Cerebras's two slots contribute 77%) — OpenRouter's free tier is only 50 RPD per model per key (4 models × 50 = 200 RPD/key), SambaNova is 20 RPD per model per key, Gemini 2.5 Flash is 250 RPD per key. Highest-ROI next step: add $10 credit on OpenRouter (the 12 slots jump from 50 to 1000 RPD — one-time cost for a permanent 20× boost).