What Happens After You Send a Message: Full Data Path
From the moment you click send to the first token appearing on screen — what Deeppin does at every step. Frontend, network, backend, context assembly, RAG, LLM, SSE, persistence, and the boundary between parallel and sequential work.
From the moment you click send to the AI's reply appearing on screen, a message travels through multiple stages — each with its own concurrency structure. Mapping this path completely makes many 'why is it designed this way?' questions answer themselves.
§ 01Part 1 — Frontend: pre-flight preparation (~0ms, synchronous)
Before any network request leaves the browser, the frontend does three things:
- Insert the message into Zustand store, immediately render to UI (optimistic update — user sees the message as sent)
- Mark current thread as isStreaming=true, disable input box
- Create AbortController, bind to this request (for user cancellation)
const handleSend = async (content: string) => {
addMessage(threadId, { role: "user", content }); // optimistic
setStreaming(threadId, true);
const abort = new AbortController();
abortRefs.current[threadId] = abort;
const res = await fetch(`/api/threads/${threadId}/chat`, {
method: "POST",
body: JSON.stringify({ message: content }),
signal: abort.signal,
});
await consumeStream(res.body!, threadId);
};§ 02Part 2 — Network: Vercel → Oracle (~50–150ms RTT)
The request goes from browser to Vercel edge, then through Nginx reverse proxy to FastAPI. Critical configuration: proxy_buffering off so SSE tokens aren't accumulated, keep-alive to avoid TCP re-establishment, proxy_read_timeout 300s for slow LLM responses.
§ 03Part 3 — Backend entry: FastAPI routing (~1ms)
FastAPI verifies the JWT, extracts thread_id and message, and immediately returns a StreamingResponse. This object wraps an async generator — FastAPI continuously pulls from the generator and writes to the HTTP response body.
@router.post("/threads/{thread_id}/chat")
async def chat(thread_id: str, body: ChatRequest, user=Depends(verify_jwt)):
return StreamingResponse(
chat_stream(thread_id, body.message, user.id),
media_type="text/event-stream",
headers={"X-Accel-Buffering": "no"},
)§ 04Part 4 — Context assembly (sequential, 10–50ms)
The first thing the generator does — and the logically most complex step. build_context assembles the message list based on where the current thread sits in the tree:
async def build_context(thread_id: str) -> list[dict]:
thread = await get_thread(thread_id)
if thread.parent_id is None:
return await build_main_context(thread_id) # recent 10 messages + summary prefix
ancestors = await get_ancestor_chain(thread_id)
budgets = compute_budgets(len(ancestors)) # [300, 200, 100, 50...]
context = []
for i, ancestor in enumerate(ancestors):
# Cache hit: no LLM call. Cache miss: compute in real-time.
summary = await get_or_create_summary(ancestor.id, budgets[i])
context.append(system_msg(summary))
context.append(anchor_msg(thread.anchor_text))
context.extend(await get_recent_messages(thread_id))
return contextAncestor summaries are almost always cache hits (maintained at write time), so build_context is pure DB reads — no extra LLM calls.
§ 05Part 5 — Parallel phase: RAG retrieval + query detection (~50–200ms)
After context assembly, RAG retrieval and search query detection run concurrently — they're independent, no reason to serialize:
rag_chunks, needs_search = await asyncio.gather(
retrieve_rag(thread_id, message),
should_search(message),
)
async def retrieve_rag(thread_id, query):
query_vec = embedding_model.encode(query) # ~50ms
return await pgvector_search(thread_id, query_vec, top_k=5) # ~20ms
if needs_search:
search_results = await searxng_search(message)
context.append(format_search_context(search_results))Retrieved RAG chunks are injected into a system message in the context, after ancestor summaries but before the user's message. If no relevant chunks are found (similarity below threshold), they're silently omitted — no effect on normal conversation.
§ 06Part 6 — LLM call: LiteLLM Router → Groq (50–500ms to first token)
Context is assembled. LiteLLM Router selects the deployment with the most remaining quota (usage-based-routing) and sends a stream=True request to Groq:
async for chunk in await router.acompletion(
model="chat", messages=context, stream=True, max_tokens=2048
):
token = chunk.choices[0].delta.content
if token:
tokens_buffer.append(token)
yield f"data: {json.dumps({'type':'token','text':token})}\n\n"Each token flows: Groq → LiteLLM → FastAPI generator → Nginx (no buffering) → browser ReadableStream → Zustand appendToken → React re-render. End-to-end per-token latency: ~5–20ms.
§ 07Part 7 — Stream end: parallel persistence (non-blocking)
After the generator yields [DONE], two concurrent DB writes and two background tasks fire — none of them block the response from closing:
yield f"data: {json.dumps({'type':'done'})}\n\n"
# Concurrent DB writes
await asyncio.gather(
save_user_message(thread_id, user_msg),
save_assistant_message(thread_id, full_reply),
)
# Fire-and-forget background tasks
asyncio.create_task(update_summary_cache(thread_id))
asyncio.create_task(extract_memory(thread_id, full_reply))Note: user message and AI reply are persisted after the stream ends, not at send time. This ensures only complete message pairs are stored — no half-pairs from interrupted streams.
§ 08Part 8 — Full timeline
§ 09Part 9 — Call count by scenario
The number of external calls varies significantly by scenario:
Scenario DB reads DB writes LLM calls embeds SearXNG ──────────────────────────────────────────────────────────────────────────────── Main thread, no attachments 3-4 2 1 1 0 Main thread, with attachments 3-4 2 1 2 0 Main thread, web search 3-4 2 1-2 2 1 Sub-thread (depth 1) 4-5 2 1 1 0 Sub-thread (depth 3) 6-8 2 1 1 0 ──────────────────────────────────────────────────────────────────────────────── ※ LLM calls: 1 main conversation + optional 1 for query detection (summarizer tier) ※ embeds: 1 to vectorize the query; +1 if web search triggers its own embed ※ background tasks (summary update, memory extraction) not on the critical path
§ 10Part 10 — The parallel boundaries that matter
Two places in this pipeline have the most carefully considered concurrency boundaries:
- RAG retrieval and query detection are concurrent: both prepare material for the LLM, have no dependency on each other — saves 50–200ms
- DB writes happen after stream end, and both INSERTs are concurrent: they don't occupy any of the LLM streaming window
- Summary cache update is a fire-and-forget background task: never affects the current request's response time
Every serial step has a reason (dependency). Every parallel step has a reason (no dependency + measurable gain). This explicit awareness of parallelism boundaries is the core mental model for writing high-performance async services.
§ 11Part 11 — Full component chain
The diagram below maps every component and module in the Deeppin codebase that participates in this path — from the moment the user hits send to post-stream persistence. Each box is a real file or function. Orange = concurrent execution. Indigo (accent) = critical path bottleneck.