Deeppin/ articles
Article · SmartRouter

SmartRouter: Multi-Provider Free Tier Stacking with Intelligent Routing

Deeppin's SmartRouter stacks free-tier quotas from 5 providers (Groq, Cerebras, SambaNova, Gemini, OpenRouter) across 15 models. Real-time usage tracking and proactive scoring select the best (provider, model, key) slot before 429 errors occur.

2026-04-1638 min readSmartRoutermulti-providercost-optimization

A single LLM provider's free tier is limited — Groq allows 30 requests per minute, Gemini 10. But quotas across different providers are completely independent. SmartRouter's core idea: pool all providers' quotas into a unified resource pool, using real-time usage tracking to pick the optimal path.

§ 01Part 1 — Aggregate capacity: how many GPUs would this take to self-host

The numbers first. At the current production setup (2 Groq×6 + 3 Cerebras×2 + 3 SambaNova×3 + 2 Gemini×2 + 3 OpenRouter×4 = 13 keys across 43 slots), running full tilt:

  • Peak throughput ~2.52M TPM ≈ 42,000 tokens/sec sustained
  • Daily budget ~236M tokens/day — roughly 94K messages at 2.5K tokens each
  • Largest callable models: 671B MoE / 37B active (SambaNova's DeepSeek-V3.2), 400B MoE / 17B active (SambaNova's Llama-4-Maverick)

Hardware equivalent: 42K tok/s sustained is roughly 1.5× what a fully loaded 8×H100 DGX delivers — ~12 saturated H100s, or ~$600K hardware + ~$15K/month in colo/power/cooling/ops, or ~$30K/month amortized over 3 years. Deeppin's actual monthly bill: $0. Five free tiers stacked happens to cover exactly this throughput band.

Model diversity is a separate story. The largest models in the pool are 671B MoE (DeepSeek-V3.2) and 400B MoE (Llama-4-Maverick). Self-hosting a model at that scale needs 16+ H100s just to load weights at FP8 ($400K+ of hardware for that single model), or rents at $50-100/hour. Here it's free and on-demand.

i42K tok/s is the theoretical peak with every slot maxed simultaneously. Real-world load is uneven — a single concurrent burst hits 5-10K tok/s, still far beyond what a solo dev or small team needs. Deeppin's current DAU is single-digit; headroom utilization is under 1%.

§ 02Part 2 — How to scale (and why you probably won't need to)

If this pool ever actually saturates, the scale-out path from cheap to expensive:

2.1 Horizontal scale: add keys

Each provider's API_KEYS env var in backend/.env is a JSON array, and `llm_client.py`'s `_load_keys` parses it — each key expands into its own set of slots. Adding one Groq key gives you 5 more slots (one per model), instantly doubling chat / merge / summarizer throughput.

# Example: Groq from 2 keys to 3 keys
# backend/.env
GROQ_API_KEYS=["gsk_key1","gsk_key2","gsk_key3"]

# At startup SmartRouter sees 3 keys × 6 models = 18 slots
# RPM/TPM/RPD/TPD scale linearly by 50% (was 2 keys → 12 slots)

Constraint: each key requires a new account (Groq/SambaNova want email + phone, Gemini wants a new Google account, OpenRouter a new signup). OpenRouter's :free tier has upstream global throttling, so additional keys there see diminishing returns.

2.2 Vertical scale: pay to upgrade

  • OpenRouter: add $10 credit → RPD jumps from 200 to 1000 (5×), one-time payment, permanent
  • Groq paid tier → RPM/TPM go up by orders of magnitude, billed per token
  • Gemini paid tier → RPD ceiling from 1000 to 10000+, billed per token
  • SmartRouter doesn't distinguish free from paid slots — bump the ModelSpec's rpm/tpm/rpd and routing logic stays identical

2.3 Add a new provider

LiteLLM is compatible with 100+ providers. Add a new row to `llm_client.py`'s ModelSpec list and a matching env read in `_load_keys`. DeepInfra (cheap 70B inference), Together (400B+ models), Fireworks, Anthropic, OpenAI itself — all plug in without further changes.

But — solo project, single-digit DAU, current architecture's theoretical capacity is 1000× over actual use. The scale-out path is documented here because knowing where the ceiling is — and that breaking through takes 5 minutes — matters more than actually scaling. The rest of this post explains how the current structure supports this order of magnitude in the first place.

§ 03Part 3 — Why not LiteLLM Router

The previous approach used LiteLLM's built-in Router, which is reactive — send request → get 429 → retry next. Each 429 wastes a round-trip (200-500ms), and users notice the stutter. SmartRouter switches to proactive selection: score all slots before sending, drastically reducing 429 occurrences.

§ 04Part 4 — Data structures

SmartRouter has three layers of data structures:

ModelSpec — Model specification (static config)
  provider: "groq" | "cerebras" | "sambanova" | "gemini" | "openrouter"
  model_id: "llama-3.3-70b-versatile"
  rpm / tpm / rpd / tpd: rate limits
  groups: ["chat", "merge"]  ← which groups it belongs to

Slot = ModelSpec + API Key + UsageBucket
  A slot is the smallest routing unit
  e.g.: (groq/llama-3.3-70b, gsk_key1, usage_bucket)
  e.g.: (groq/llama-3.3-70b, gsk_key2, usage_bucket)  ← same model, different key = different slot

UsageBucket — Usage tracking (independent per slot)
  rpm_used / tpm_used → auto-reset every 60 seconds
  rpd_used / tpd_used → reset on natural-day rollover in spec.reset_tz (Gemini=PT, others=UTC)
  fail_count / last_fail_ts → failure penalty

§ 05Part 5 — Five providers × 15 models — categorized and ranked

15 models split into 4 use-case groups (chat / merge / summarizer / vision). Within each group, ranked by capability for that task: chat by model size, merge by maximum input it can fit (TPM-dominated), summarizer by output speed, vision by multimodal quality. Rate limits below are per-slot — multiply by key count for total contribution.

chat (main conversation) — sorted by model size, descending

#   Provider/Model                                    Size                 RPM  TPM   RPD   TPD
────────────────────────────────────────────────────────────────────────────────────────────
1   sambanova/DeepSeek-V3.2                           671B MoE / 37B act   20   100K  20    200K
2   sambanova/Llama-4-Maverick-17B-128E-Instruct      400B MoE / 17B act   20   100K  20    200K
3   cerebras/qwen-3-235b-a22b-instruct-2507           235B MoE / 22B act   30   60K   14.4K 1M
4   openrouter/nvidia/nemotron-3-super-120b:free      120B MoE             20   10K   50    2M
5   openrouter/openai/gpt-oss-120b:free               120B                 20   10K   50    2M
6   groq/openai/gpt-oss-120b                          120B                 30   8K    1K    200K
7   groq/meta-llama/llama-4-scout-17b-16e-instruct    109B MoE / 17B act   30   30K   1K    500K
8   openrouter/qwen/qwen3-next-80b-a3b:free           80B MoE / 3B act     20   10K   50    2M
9   sambanova/Meta-Llama-3.3-70B-Instruct             70B                  20   100K  20    200K
10  groq/llama-3.3-70b-versatile                      70B                  30   12K   1K    100K
11  openrouter/meta-llama/llama-3.3-70b:free          70B                  20   10K   50    2M
12  groq/qwen/qwen3-32b                               32B                  60   6K    1K    500K
13  groq/openai/gpt-oss-20b                           20B                  30   8K    1K    200K
14  gemini/gemini-2.5-flash                           Small but tuned      10   250K  250   50M
15  gemini/gemini-2.5-flash-lite                      Smallest             15   250K  1K    50M

merge (combined output) — sorted by max single-call merge size

Merge calls often pack 5-15K tokens at once (multiple sub-threads combined). TPM dominates — models with low TPM saturate the per-minute quota in a single merge.

#  Provider/Model                                  Max single merge   RPM  TPM   RPD   TPD
──────────────────────────────────────────────────────────────────────────────────────────
1  gemini/gemini-2.5-flash                         ~250K tokens       10   250K  250   50M
2  sambanova/Meta-Llama-3.3-70B-Instruct           ~100K tokens       20   100K  20    200K
3  sambanova/Llama-4-Maverick-17B-128E             ~100K tokens       20   100K  20    200K
4  sambanova/DeepSeek-V3.2                         ~100K tokens       20   100K  20    200K
5  cerebras/qwen-3-235b-a22b-instruct-2507         ~60K tokens        30   60K   14.4K 1M
6  groq/meta-llama/llama-4-scout-17b-16e           ~30K tokens        30   30K   1K    500K
7  groq/llama-3.3-70b-versatile                    ~12K tokens (full) 30   12K   1K    100K

summarizer (summaries / classification / formatting) — sorted by speed

Summarizer is for lightweight internal tasks (compact summaries, intent classification, JSON formatting). Inputs and outputs are short but calls are frequent — speed matters more than size.

#  Provider/Model                            Speed (tok/s)        RPM  TPM   RPD   TPD
───────────────────────────────────────────────────────────────────────────────────────
1  cerebras/llama3.1-8b                      ~3,000 (fastest)     30   60K   14.4K 1M
2  gemini/gemini-2.5-flash-lite              ~500, very high TPM  15   250K  1K    50M
3  groq/llama-3.1-8b-instant                 ~750                 30   6K    14.4K 500K

vision (image understanding) — sorted by multimodal quality

#  Provider/Model                                  Multimodal note         RPM  TPM   RPD   TPD
────────────────────────────────────────────────────────────────────────────────────────────
1  gemini/gemini-2.5-flash                         Native Google, best     10   250K  250   50M
2  groq/meta-llama/llama-4-scout-17b-16e           Llama 4 multimodal, fast 30  30K   1K    500K
iKey complementarity: Gemini's two keys together contribute 500K TPM and 200M TPD (85% of the global daily budget) — the backbone of merge. SambaNova's three 100K-TPM models (including DeepSeek-V3.2) back up merge and serve 70B/400B chat. Cerebras packs 235B MoE with 60K TPM and ultra-fast inference, and its 3 keys × 14.4K RPD is the chat-RPD backbone (77% of chat RPD). OpenRouter brings the 120B reasoning model (nvidia/nemotron-super) and long-context qwen3-next-80b. Groq is fastest with the highest summarizer RPD (14.4K) but the smallest per-model TPM, ideal for high-frequency small requests.

§ 06Part 6 — Scoring mechanism

Each slot's availability score is computed in real-time by its UsageBucket:

def score(self, spec: ModelSpec) -> float:
    # Remaining ratio per dimension
    rpm_r = (spec.rpm - self.rpm_used) / spec.rpm
    tpm_r = (spec.tpm - self.tpm_used) / spec.tpm
    rpd_r = (spec.rpd - self.rpd_used) / spec.rpd

    # Minimum — bucket effect, tightest dimension determines availability
    s = min(rpm_r, tpm_r, rpd_r)

    # Penalty for recent failures (30-second half-life)
    if self._fail_count > 0:
        elapsed = now - self._last_fail_ts
        penalty = 0.5 ** (elapsed / 30)
        s *= (1 - penalty)

    return s  # 0 = exhausted, 1.0 = full capacity

Scoring granularity is provider + model + key. The same Groq key's llama-70b and qwen3-32b have independent usage buckets, because Groq's rate limits are per-model per-key.

§ 07Part 7 — Selection and fallback flow

The complete request routing flow:

router.completion(group="chat", messages=...)
│
├── Step 1: Get all slots from the chat group
│   Sort by score descending, pick highest for the request
│   (add small random jitter to avoid thundering herd)
│
├── Success → return result
│
├── Failure (429/5xx) → mark failure, try next slot
│   ... all chat slots failed ...
│
├── Enter fallback chain
│   chat → summarizer
│   merge → chat → summarizer
│   summarizer → chat
│
└── All exhausted → pick soonest-recovery slot
    slot A: rpm full, 45 seconds until reset
    slot B: rpd full, 8 hours until reset
    → pick slot A

§ 08Part 8 — Time-window auto-reset

UsageBucket counters: minute window is a rolling 60s based on time.monotonic(); day window is aligned to the natural date boundary in spec.reset_tz (Gemini=America/Los_Angeles, others=UTC), zeroing on rollover so it matches the provider's actual 00:00 reset rather than drifting from process start.

  • rpm_used / tpm_used: auto-reset when ≥60 seconds since last reset
  • rpd_used / tpd_used: zero out when the provider's timezone rolls to a new date
  • Checked automatically before every record_request() and score() call
  • No timers or background threads needed — lazy reset, zero overhead

§ 09Part 9 — Proactive vs reactive

This is the core difference between SmartRouter and traditional routing:

Old approach (LiteLLM Router, reactive):
  Send request → get 429 → retry next → 429 again → retry again
  Each 429 wastes 200-500ms round-trip
  User experience: occasional noticeable stutter

New approach (SmartRouter, proactive):
  Score all slots → pick best → send request
  Slots with score=0 are never selected, 429 probability drops sharply
  User experience: nearly invisible routing switches

§ 10Part 10 — Deployment configuration

Adding a new provider requires only an environment variable — SmartRouter auto-discovers it:

# backend/.env — each value is a JSON array, supports multi-key stacking
GROQ_API_KEYS=["gsk_key1", "gsk_key2"]
CEREBRAS_API_KEYS=["csk_key1", "csk_key2"]
SAMBANOVA_API_KEYS=["sk_key1", "sk_key2"]
GEMINI_API_KEYS=["AIza_key1"]
OPENROUTER_API_KEYS=["sk-or-v1-key1", "sk-or-v1-key2"]

# Unconfigured providers produce zero slots, no impact on operation
# Restart after adding keys to take effect

GitHub Actions deployment automatically syncs keys to the server via Secrets — no manual SSH needed.

§ 11Part 11 — Health check

GET /health/providers/keys validates each (provider, key) against /v1/models and detects catalog drift without burning any LLM quota — the daily CI runs this endpoint. GET /health/providers/full actually fires a completion against every slot (consumes quota; used only for manual deep checks):

GET /health/providers/keys
{
  "total": 13,
  "ok": 12,
  "failed": 1,
  "results": [
    {"provider": "groq", "key": "gsk_abc1...", "ok": true, "models_seen": 6},
    {"provider": "cerebras", "key": "csk_xyz...", "ok": true, "models_seen": 2},
    {"provider": "sambanova", "key": "sk_...", "ok": false, "error": "401 Unauthorized"},
    ...
  ]
}

§ 12Part 12 — Actual capacity estimate

Current production config: 2 Groq + 3 Cerebras + 3 SambaNova + 2 Gemini + 3 OpenRouter = 13 keys, 43 slots total (weighted by per-provider model count: 2×6 + 3×2 + 3×3 + 2×2 + 3×4).

  • Chat group: 38 slots, aggregate ~920 RPM, ~2.33M TPM, ~56.5K RPD, ~232M TPD
  • Merge group: 18 slots, aggregate ~410 RPM, ~1.66M TPM, ~47.9K RPD, ~106M TPD
  • Summarizer group: 7 slots, aggregate ~180 RPM, ~692K TPM, ~74K RPD, ~104M TPD
  • Vision group: 4 slots, aggregate ~80 RPM, ~560K TPM, ~2.5K RPD, ~101M TPD
  • Global peak TPM: ~2.52M (≈ 42,000 tok/s sustained, roughly 1.5× a fully loaded 8×H100 DGX / ~12 saturated H100s)
  • Global daily quota: ~236M tokens/day; at 2.5K tokens/message, supports ~94K messages/day
  • Equivalent DAU: ~9,000-15,000 active users (6 conversation turns per user per day)

Compared to the original Groq-only setup (300-600 DAU), five-provider stacking lifts capacity by ~25x. The bottleneck is chat-group RPD (56.5K/day; Cerebras's two slots contribute 77%) — OpenRouter's free tier is only 50 RPD per model per key (4 models × 50 = 200 RPD/key), SambaNova is 20 RPD per model per key, Gemini 2.5 Flash is 250 RPD per key. Highest-ROI next step: add $10 credit on OpenRouter (the 12 slots jump from 50 to 1000 RPD — one-time cost for a permanent 20× boost).