Response Optimizations That Matter for User Experience
From UUID pre-generation and SSE streaming to Nginx configuration and frontend rendering — every engineering detail that affects perceived speed.
Perceived speed ≠ actual latency. A system that starts showing content after 2 seconds feels slower than one that starts streaming character-by-character after 0.5 seconds, even if total completion time is the same. Deeppin's optimization focus is Time to First Token and eliminating the sense of waiting.
§ 01Part 1 — UUID pre-generation: zero-wait new conversations
Traditional flow: click 'New Chat' → request creates session → wait for DB write → navigate. That's 200–600ms of perceived waiting.
Deeppin's approach: generate a UUID client-side immediately after login. On click, navigate instantly using that UUID. The chat page creates the DB record lazily on initialization.
const prewarm = () => { prewarmedRef.current = crypto.randomUUID(); };
const handleNewChat = async () => {
if (prewarmedRef.current) {
const id = prewarmedRef.current;
prewarmedRef.current = null;
router.push(`/chat/${id}`); // immediate navigation
prewarm(); // pre-generate for next time
return;
}
};200–600ms wait becomes 0ms perceived latency.
§ 02Part 2 — Initial message passing
When a user types a message on the home page and clicks send, it needs to jump to the chat page and send that message. Cross-page parameter passing uses sessionStorage:
// Home page: save message then navigate
sessionStorage.setItem("deeppin:pending-msg", message.trim());
router.push(`/chat/${id}`);
// Chat page: read on initialization
const pending = sessionStorage.getItem("deeppin:pending-msg");
if (pending) {
sessionStorage.removeItem("deeppin:pending-msg");
await sendMessage(pending);
}§ 03Part 3 — SSE streaming
LLM responses stream token-by-token via SSE rather than waiting for full generation. Users see content appear while the LLM is still generating.
# FastAPI async generator
async def stream_response():
async for chunk in router.completion(**params, stream=True):
token = chunk.choices[0].delta.content or ""
if token:
yield f"data: {json.dumps({'type':'token','text':token})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(stream_response(), media_type="text/event-stream")The frontend receives tokens via EventSource and appends each one to the current message:
const source = new EventSource(`/api/threads/${threadId}/chat`);
source.onmessage = (e) => {
const data = JSON.parse(e.data);
if (data.type === "token") {
setCurrentMessage(prev => prev + data.text);
}
};§ 04Part 4 — Nginx: buffering must be disabled
This is the most common mistake. Nginx buffers proxy responses by default, accumulating chunks before forwarding. For SSE, this means tokens batch up and arrive all at once — destroying the streaming effect.
location / {
proxy_pass http://localhost:8000;
proxy_buffering off; # critical
proxy_cache off; # critical
proxy_read_timeout 300s; # LLM can be slow
proxy_http_version 1.1;
proxy_set_header Connection "";
}§ 05Part 5 — Per-thread stream state in Zustand
Deeppin supports concurrent streaming across multiple threads (main thread and several pins simultaneously). Each thread has isolated stream state keyed by threadId in Zustand, so concurrent streams never interfere:
// useStreamStore.ts
interface StreamStore {
streams: Record<string, {
isStreaming: boolean;
content: string;
error: string | null;
}>;
appendToken: (threadId: string, token: string) => void;
setStreaming: (threadId: string, value: boolean) => void;
}
// Isolated by threadId — no interference
appendToken: (threadId, token) =>
set(state => ({
streams: {
...state.streams,
[threadId]: {
...state.streams[threadId],
content: (state.streams[threadId]?.content ?? "") + token,
},
},
})),§ 06Part 6 — Streaming Markdown rendering
Markdown markers like **bold** appear malformed mid-stream (one ** without its closing pair). Solution: show raw text during streaming, offer a toggle to rendered Markdown after completion. Users can switch at any time.