Deeppin/ articles
Article · capacity

How Many Users Can This Free-Tier Stack Handle? Capacity Limits and Optimization Paths

Deeppin runs entirely on free tiers: 5 LLM providers (Groq + Cerebras + SambaNova + Gemini + OpenRouter, stacked via SmartRouter), Oracle ARM, Supabase, Vercel. A component-by-component analysis of actual capacity limits, the current bottleneck, and the theoretical path to break through each one.

2026-04-1518 min readcapacitycost-optimizationarchitecture

Deeppin's infrastructure is entirely free: Groq free tier, Oracle Cloud Free Tier (ARM instance), Supabase free tier, Vercel Hobby. Zero cost to run an AI application — but where are the limits?

§ 01Part 1 — LLM API (five providers) — capacity now ample, bottleneck dimension flipped

SmartRouter stacks 5 free providers (Groq, Cerebras, SambaNova, Gemini, OpenRouter). Production config: 13 keys, 43 slots total (2 Groq×6 + 3 Cerebras×2 + 3 SambaNova×3 + 2 Gemini×2 + 3 OpenRouter×4):

  • Global peak TPM: ~2.52M (≈ 42,000 tok/s sustained, roughly 1.5× a fully loaded 8×H100 DGX / ~12 saturated H100s)
  • Global daily RPD ceiling: ~128,000 requests (five providers combined)
  • Global daily token quota: ~236M tokens/day
  • At ~2.5K tokens per message, supports ~94,000 messages/day
  • Complete conversations per day: 9,400–18,800 (assuming 5–10 turns each)
  • Equivalent DAU: ~9,000–15,000 active users (6 turns/user/day)

The bottleneck is chat-group RPD — ~56.5K/day total, of which Cerebras's two 235B MoE slots contribute 77%. OpenRouter is 50 RPD per model per key (4 models × 50 = 200 RPD/key), SambaNova is 20 RPD per model per key, Gemini 2.5 Flash is 250 RPD per key, Flash-Lite is 1K RPD.

iOptimization path: (1) add $10 credit on OpenRouter — the 12 slots jump from 50 to 1000 RPD each (20×, one-time cost, permanent), highest-ROI move right now; (2) add a 4th Cerebras key to linearly expand chat RPD (the tightest link); (3) add another Groq key to back up summarizer / vision bandwidth. End state: enable local inference (Ollama + quantized) as extreme fallback.

§ 02Part 2 — Compute (Oracle ARM) — surplus

Oracle Cloud Free Tier provides 4 ARM cores, 24GB RAM, permanently free. Actual Deeppin backend load:

  • FastAPI process: <5% single-core CPU (I/O-bound, mostly waiting for Groq responses)
  • Embedding model (bge-m3): ~570MB RAM, 50–200ms inference latency per request
  • SmartRouter: <100MB RAM (pure Python usage tracking, no separate process)
  • Total memory: ~2–3GB / 24GB (<15%)

Compute is surplus, not a bottleneck. The actual limit on the free ARM instance is network egress (~10TB/month from Oracle — far more than needed at this scale).

iOptimization path: if higher embedding throughput is needed, run multiple embedding service instances on the same machine with load balancing. Compute can theoretically handle 10x the current user base.

§ 03Part 3 — Database (Supabase) — storage bottleneck

Supabase free tier: 500MB PostgreSQL storage, no compute limits (shared instance), 10,000 MAU for authentication.

  • Per message: ~500 bytes text + metadata ≈ 1KB
  • 10-turn conversation = 10KB; 1,000 conversations = 10MB
  • Vector data: each 1024-dim float32 vector = 4KB; 10K chunks = 40MB
  • 500MB holds ~45,000 conversations + 5,000 document vectors
  • Corresponds to ~7,500 active users (6 conversation histories each)
iOptimization: archive conversations older than 90 days (retain summaries); compress vector table with float16 half-precision (halves storage); use Supabase pg_cron for periodic archiving. At 400MB used, upgrade to the $25/month plan for 8GB storage.

§ 04Part 4 — Local RAG (Embedding + pgvector) — hidden costs

Deeppin's RAG pipeline is fully self-hosted: embedding model runs on Oracle ARM, vectors stored in Supabase pgvector. Superficially 'free', but two hidden costs need analysis.

Embedding inference: CPU throughput is the limit

bge-m3 is a 570MB sentence-transformer running on ARM CPU (no GPU). Single embed latency is 50–200ms depending on text length. Uploading a 10-page PDF creates ~50 chunks — sequential embedding takes 5–10 seconds.

# Current: sequential (slow)
for chunk in chunks:
    vec = model.encode(chunk)    # ~100ms each
    await save(chunk, vec)
# 50 chunks × 100ms = 5 seconds

# Optimized: batch inference
vecs = model.encode(chunks, batch_size=16)  # ~1.5 seconds

Query-time embedding (single message → vector) takes ~50ms and is not a bottleneck. The bottleneck is batch chunk embedding during document upload.

iOptimization: enable batch_size=16 for 3–4x throughput; cache embedding results (same chunk content skips re-embedding); float16 quantization halves memory from 570MB to ~290MB with ~40% latency reduction.

Vector storage: shares the 500MB quota with relational data

pgvector data shares Supabase's 500MB storage with regular relational data. Each 1024-dim float32 vector = 4KB. A 10-page PDF (~50 chunks) = 200KB vector storage. 100 documents = 20MB — 4% of total quota.

This scale is manageable, but conversation_memories (1–3 memory vectors extracted per AI reply) grows linearly over time: 1,000 conversations add ~2,000 memory vectors = 8MB.

iOptimization: set TTL on conversation_memories, auto-purge vectors not accessed in 90 days; archive low-activity session chunks to Supabase Storage (free 1GB); pgvector 0.7+ supports halfvec type, halving vector storage.

RAG combined limits

  • Concurrent embedding requests: single-threaded ARM CPU, effective concurrency 1, multiple requests queue
  • Max simultaneous uploads: ~2–3 (async, each taking 5–10 seconds)
  • Vector retrieval latency: 10–50ms (HNSW index, tens-of-thousands scale) — not a bottleneck
  • Storage capacity: ~500 documents (10 pages each) or ~25,000 conversation memories

§ 05Part 5 — Frontend hosting (Vercel) — not a bottleneck

Vercel Hobby limits: 100GB bandwidth/month, no concurrency cap (serverless auto-scales). For Next.js static assets (<1MB/page), 100GB bandwidth supports ~100,000 page loads — far beyond current-stage needs.

iOptimization: Vercel's static assets have global CDN, minimal optimization needed. If the backend API moves to Vercel, leverage Fluid Compute for cold-start optimization.

§ 06Part 6 — Combined capacity summary

Component              Free Limit                Utilization   Bottleneck?
LLM API (5 Providers)  ~9,000-15,000 DAU         Low           ★★  chat RPD
Supabase DB            ~7,500 registered         Low           ★★★ Storage-tight
Oracle ARM (CPU)       ~2-3 concurrent uploads   Very low      ★   On upload
Oracle ARM (memory)    ~6,000 connections        Very low      ·   Surplus
Vercel                 ~100K page loads/mo       Very low      ·   Surplus

At the current stage (early user validation), this stack is entirely sufficient. SmartRouter's five-provider stacking raised the LLM bottleneck from 300 DAU (single provider) to 9,000+ DAU — LLM is no longer the tightest link. Supabase's 500MB storage cap (~7,500 registered users) is now the next bound.

§ 07Part 7 — Bottleneck breakthrough priority

  • Priority 1: Add $10 credit on OpenRouter — the 12 slots jump from 50 to 1000 RPD each (20×, one-time cost, permanent)
  • Priority 2: When Supabase reaches 400MB, upgrade to $25/month for 8GB storage
  • Priority 3: Add a 4th Cerebras key to expand chat RPD (Cerebras contributes 77% of chat RPD; linear scaling, zero cost)
  • Priority 4: Enable batch embedding + result caching — 3–4x upload throughput
  • Priority 5: Implement LLM response caching (similar questions hit cache, saving 60–80% of requests)
  • End state: self-hosted quantized models (llama.cpp) + paid API hybrid routing

For an AI startup, this zero-cost architecture keeps capital where it matters most — product development and user growth — rather than infrastructure bills. By the time scale demands spending, the system is validated and the investment is justified.