Article · architecture

What Happens After You Send a Message: Full Data Path

From the moment you click send to the first token appearing on screen — what Deeppin does at every step. Frontend, network, backend, context assembly, RAG, LLM, SSE, persistence, and the boundary between parallel and sequential work.

2026-04-1519 min readarchitecturedata-pathSSE

From the moment you click send to the AI's reply appearing on screen, a message travels through multiple stages — each with its own concurrency structure. Mapping this path completely makes many 'why is it designed this way?' questions answer themselves.

§ 01Part 1 — Frontend: pre-flight preparation (~0ms, synchronous)

Before any network request leaves the browser, the frontend does three things:

Insert the message into Zustand store, immediately render to UI (optimistic update — user sees the message as sent)
Mark current thread as isStreaming=true, disable input box
Create AbortController, bind to this request (for user cancellation)

const handleSend = async (content: string) => {
  addMessage(threadId, { role: "user", content });  // optimistic
  setStreaming(threadId, true);
  
  const abort = new AbortController();
  abortRefs.current[threadId] = abort;
  
  const res = await fetch(`/api/threads/${threadId}/chat`, {
    method: "POST",
    body: JSON.stringify({ message: content }),
    signal: abort.signal,
  });
  
  await consumeStream(res.body!, threadId);
};

§ 02Part 2 — Network: Vercel → Oracle (~50–150ms RTT)

The request goes from browser to Vercel edge, then through Nginx reverse proxy to FastAPI. Critical configuration: proxy_buffering off so SSE tokens aren't accumulated, keep-alive to avoid TCP re-establishment, proxy_read_timeout 300s for slow LLM responses.

§ 03Part 3 — Backend entry: FastAPI routing (~1ms)

FastAPI verifies the JWT, extracts thread_id and message, and immediately returns a StreamingResponse. This object wraps an async generator — FastAPI continuously pulls from the generator and writes to the HTTP response body.

@router.post("/threads/{thread_id}/chat")
async def chat(thread_id: str, body: ChatRequest, user=Depends(verify_jwt)):
    return StreamingResponse(
        chat_stream(thread_id, body.message, user.id),
        media_type="text/event-stream",
        headers={"X-Accel-Buffering": "no"},
    )

§ 04Part 4 — Context assembly (sequential, 10–50ms)

The first thing the generator does — and the logically most complex step. build_context assembles the message list based on where the current thread sits in the tree:

async def build_context(thread_id: str) -> list[dict]:
    thread = await get_thread(thread_id)
    
    if thread.parent_id is None:
        return await build_main_context(thread_id)  # recent 10 messages + summary prefix
    
    ancestors = await get_ancestor_chain(thread_id)
    budgets = compute_budgets(len(ancestors))  # [300, 200, 100, 50...]
    
    context = []
    for i, ancestor in enumerate(ancestors):
        # Cache hit: no LLM call. Cache miss: compute in real-time.
        summary = await get_or_create_summary(ancestor.id, budgets[i])
        context.append(system_msg(summary))
    
    context.append(anchor_msg(thread.anchor_text))
    context.extend(await get_recent_messages(thread_id))
    return context

Ancestor summaries are almost always cache hits (maintained at write time), so build_context is pure DB reads — no extra LLM calls.

§ 05Part 5 — Parallel phase: RAG retrieval + query detection (~50–200ms)

After context assembly, RAG retrieval and search query detection run concurrently — they're independent, no reason to serialize:

rag_chunks, needs_search = await asyncio.gather(
    retrieve_rag(thread_id, message),
    should_search(message),
)

async def retrieve_rag(thread_id, query):
    query_vec = embedding_model.encode(query)   # ~50ms
    return await pgvector_search(thread_id, query_vec, top_k=5)  # ~20ms

if needs_search:
    search_results = await searxng_search(message)
    context.append(format_search_context(search_results))

Retrieved RAG chunks are injected into a system message in the context, after ancestor summaries but before the user's message. If no relevant chunks are found (similarity below threshold), they're silently omitted — no effect on normal conversation.

§ 06Part 6 — LLM call: LiteLLM Router → Groq (50–500ms to first token)

Context is assembled. LiteLLM Router selects the deployment with the most remaining quota (usage-based-routing) and sends a stream=True request to Groq:

async for chunk in await router.acompletion(
    model="chat", messages=context, stream=True, max_tokens=2048
):
    token = chunk.choices[0].delta.content
    if token:
        tokens_buffer.append(token)
        yield f"data: {json.dumps({'type':'token','text':token})}\n\n"

Each token flows: Groq → LiteLLM → FastAPI generator → Nginx (no buffering) → browser ReadableStream → Zustand appendToken → React re-render. End-to-end per-token latency: ~5–20ms.

§ 07Part 7 — Stream end: parallel persistence (non-blocking)

After the generator yields [DONE], two concurrent DB writes and two background tasks fire — none of them block the response from closing:

yield f"data: {json.dumps({'type':'done'})}\n\n"

# Concurrent DB writes
await asyncio.gather(
    save_user_message(thread_id, user_msg),
    save_assistant_message(thread_id, full_reply),
)

# Fire-and-forget background tasks
asyncio.create_task(update_summary_cache(thread_id))
asyncio.create_task(extract_memory(thread_id, full_reply))

Note: user message and AI reply are persisted after the stream ends, not at send time. This ensures only complete message pairs are stored — no half-pairs from interrupted streams.

§ 08Part 8 — Full timeline

Fig. 1·message-datapath

§ 09Part 9 — Call count by scenario

The number of external calls varies significantly by scenario:

Scenario                       DB reads  DB writes  LLM calls  embeds  SearXNG
────────────────────────────────────────────────────────────────────────────────
Main thread, no attachments      3-4        2          1         1        0
Main thread, with attachments    3-4        2          1         2        0
Main thread, web search          3-4        2         1-2        2        1
Sub-thread (depth 1)             4-5        2          1         1        0
Sub-thread (depth 3)             6-8        2          1         1        0
────────────────────────────────────────────────────────────────────────────────
※ LLM calls: 1 main conversation + optional 1 for query detection (summarizer tier)
※ embeds: 1 to vectorize the query; +1 if web search triggers its own embed
※ background tasks (summary update, memory extraction) not on the critical path

iThe critical path bottlenecks are Groq's first-token latency (50–500ms) and network RTT (50–150ms). DB reads are typically the fastest step — Supabase shared instance runs ~20–80ms per query.

§ 10Part 10 — The parallel boundaries that matter

Two places in this pipeline have the most carefully considered concurrency boundaries:

RAG retrieval and query detection are concurrent: both prepare material for the LLM, have no dependency on each other — saves 50–200ms
DB writes happen after stream end, and both INSERTs are concurrent: they don't occupy any of the LLM streaming window
Summary cache update is a fire-and-forget background task: never affects the current request's response time

Every serial step has a reason (dependency). Every parallel step has a reason (no dependency + measurable gain). This explicit awareness of parallelism boundaries is the core mental model for writing high-performance async services.

§ 11Part 11 — Full component chain

The diagram below maps every component and module in the Deeppin codebase that participates in this path — from the moment the user hits send to post-stream persistence. Each box is a real file or function. Orange = concurrent execution. Indigo (accent) = critical path bottleneck.

Fig. 2·component-chain