Deeppin System Components: What Every Part Does and How They Work Together
From Nginx to Supabase, from SmartRouter to bge-m3 — a component-by-component breakdown of what each part is, why it exists, when it's invoked, and how data flows through it.
Deeppin uses over a dozen components working together to deliver the "pin-and-explore" deep thinking experience. This article introduces each component's responsibility, when it's invoked, and how it relates to other parts. We start with the big picture, then go layer by layer.
§ 01Architecture overview
User Browser
|
|-- HTTPS --> Vercel (Next.js frontend)
| |-- components/ UI rendering
| |-- stores/ Zustand state
| +-- lib/sse.ts SSE client
|
+-- HTTPS --> Oracle Cloud (Docker Compose)
|-- Nginx Reverse proxy + TLS
|-- FastAPI Backend main process
| |-- routers/ 9 route modules
| |-- services/ 7 service modules
| +-- db/ Supabase connector
+-- SearXNG Search engineBelow, we walk through each component in the order data flows through them — from the outside in.
§ 02Part 1 — Infrastructure layer
1.1 Nginx — Reverse proxy
Nginx is the first door every request passes through to reach the backend. It handles TLS termination (Let's Encrypt certificates), HTTP-to-HTTPS redirection, and forwarding requests to FastAPI.
The most critical configuration for Deeppin is the three SSE-related directives:
proxy_buffering off; # Disable buffering — otherwise SSE streams batch up proxy_cache off; # Disable caching proxy_read_timeout 300s; # LLM generation can be slow
When it works: every single request reaching the backend passes through Nginx. It's the always-on gateway.
1.2 Docker Compose — Container orchestration
Three services (backend, searxng, nginx) are managed by Docker Compose. Startup order is guaranteed through a healthcheck chain:
backend + searxng start in parallel
|
v
backend healthcheck passes
(/health aggregates searxng + supabase + embedding + groq checks)
|
v
nginx starts, begins accepting trafficThe backend healthcheck runs every 15 seconds with a 45-second start_period (to give the embedding model time to load). Nginx sets depends_on: backend: condition: service_healthy, ensuring users never hit a half-initialized service.
1.3 Supabase — Database + Auth
Supabase provides two core capabilities:
- PostgreSQL database: stores sessions, threads, messages, thread_summaries, attachment_chunks, and conversation_memories — six tables
- Auth: user signup/login, JWT issuance and verification, RLS (Row Level Security) for per-user data isolation
The backend accesses Supabase with two keys: service_role_key (admin privileges, for JWT verification and backend operations) and anon_key (user-scoped, works with RLS). The frontend only holds the anon_key.
When it works: nearly every API request reads from or writes to Supabase — creating sessions, saving messages, reading history, storing and searching vectors.
1.4 SearXNG — Meta search engine
SearXNG is a self-hosted search engine aggregator. It sends the user's query to Google, Bing, DuckDuckGo, and other engines simultaneously, then deduplicates and returns the results.
Deeppin uses it for the "web search" feature: when a question needs real-time information (news, prices, weather, etc.), the backend searches via SearXNG first, then feeds the results to an LLM for synthesis.
# Backend calls via JSON API GET http://searxng:8080/search?q=query&format=json
When it works: only when classify_search_intent() determines the user's question needs live information. Regular conversations never trigger it.
§ 03Part 2 — Backend route layer (routers/)
FastAPI's 9 router modules each handle one category of API endpoints:
2.1 health.py — Health checks
Exposes /health (aggregated dependency status) and /health/providers (individually verifies every LLM provider+key combination). Docker healthchecks and CI smoke tests both depend on it.
When called: Docker automatically calls /health every 15 seconds; the smoke test calls it post-deployment; also useful for manual system status checks.
2.2 sessions.py — Session management
CRUD operations: create session, list a user's sessions, get session details (with thread tree), delete session. Also supports bulk-fetching all messages under a session (used for merge output).
When called: when the user creates a new conversation, opens a past one, or deletes one.
2.3 threads.py — Thread management
Creates sub-threads (pins), fetches thread details and message history, generates follow-up suggestions. Creating a sub-thread asynchronously triggers LLM-generated title and suggested questions.
When called: when a user selects text and pins it to create a sub-thread; when opening a sub-thread to view history.
2.4 stream.py — SSE streaming endpoint
The core endpoint: POST /api/threads/:id/chat. Receives the user message and returns an SSE stream. It bridges the frontend to stream_manager.
When called: every time the user sends a message, whether in the main thread or a sub-thread.
2.5 search.py — Web search
An SSE streaming endpoint. First queries SearXNG, then has the LLM answer the user's question based on search results. Supports both automatic intent detection and manual trigger.
When called: auto-triggered when a question is classified as needing real-time information, or when the user explicitly clicks the search button.
2.6 merge.py — Merge output
Collects the main thread and all sub-thread conversations, then has the LLM merge them into structured output (free summary / bullet points / structured analysis / raw transcript). Streamed.
When called: when the user clicks the "Merge Output" button.
2.7 attachments.py — File upload
Receives user-uploaded files and hands them to attachment_processor (text extraction → chunking → embedding → DB storage).
When called: when the user uploads a file (PDF, Word, code files, etc.) during a conversation.
2.8 relevance.py — Relevance assessment
Before merge output, the LLM evaluates each sub-thread's relevance to the main thread, deciding which sub-threads should be selected by default for merging.
When called: automatically invoked once when the user opens the merge panel, before rendering.
2.9 users.py — User configuration
Gets and updates user metadata (preferences, settings). Built on Supabase Auth's user_metadata field.
When called: when the user modifies their personal settings.
§ 04Part 3 — Backend service layer (services/)
The route layer handles requests and responses; the service layer handles the actual business logic. The 7 service modules are the backend's core.
3.1 llm_client.py — SmartRouter
The unified entry point for every LLM call in the system. Houses the SmartRouter, which manages 15 models and multiple API keys across 5 providers (Groq, Cerebras, SambaNova, Gemini, OpenRouter).
Caller (stream_manager / search / merge)
|
v
SmartRouter._pick_slot(group)
|
|-- Score all slots by availability
|-- score = min(RPM_remaining%, TPM_remaining%, RPD_remaining%)
|-- Recently failed slots get extra penalty (30s half-life)
+-- All exhausted -> pick slot with soonest recovery
|
v
litellm.acompletion(model, messages, api_key)
|
|-- Success -> record_success()
+-- Failure -> record_failure() -> retry next slot
+-- All slots in group failed -> fallback chainModels are grouped into 4 tiers: chat (main conversations), merge (merge output), summarizer (summaries/classification), vision (image understanding). Fallback chain: chat->summarizer, merge->chat->summarizer.
When it works: every single LLM call goes through SmartRouter — main chat, summaries, merge, search intent classification, sub-thread title generation, relevance assessment.
3.2 stream_manager.py — SSE stream manager
The most complex service in the entire system. It orchestrates the complete flow for one conversation turn:
User message arrives
|
|-- 1. yield ping (prevent connection timeout)
|-- 2. Save user message to DB
|-- 3. Fetch thread metadata (depth / session_id / is_first_round)
|-- 4. Build context (context_builder) + RAG injection (memory_service)
|-- 5. Detect search intent (classify_search_intent)
| |-- Needs search -> yield search event, use search_service
| +-- No search -> continue
|-- 6. Call LLM streaming (chat_stream)
| |-- yield tokens to frontend in real time
| +-- Strip META block in real time (summary + title)
|-- 7. Save assistant message to DB
|-- 8. yield done
+-- 9. Background tasks (_track lifecycle)
|-- Write summary from META (fallback: merge_summary)
|-- Write title on first main-thread round
+-- Write conversation_memory embedding every N roundsWhen it works: every message the user sends runs through this complete pipeline.
3.3 context_builder.py — Context construction
Assembles the messages array for each LLM call. The core strategy is "deeper = more compressed":
# Token budgets by depth _BUDGETS_BY_DEPTH = [800, 500, 300, 150] # Main thread: summary (if >10 messages) + RAG + last 10 messages # Sub-thread: ancestor summary chain + anchor text + RAG + current thread history
When it works: build_context() is called before every LLM invocation. Main threads and sub-threads follow different construction logic.
3.4 memory_service.py — Dual-track RAG memory
Manages two parallel RAG retrieval tracks:
- attachment_chunks: vector chunks from user-uploaded files. When the user asks "what does paragraph three say?", it precisely recalls that chunk
- conversation_memories: vectorized summaries of each conversation turn. When the user asks "what did we discuss earlier?", it recalls relevant history
When it works: context_builder calls retrieve_rag_context() to search both tracks concurrently when building context; stream_manager calls store_conversation_memory() after each turn to store new memories.
3.5 embedding_service.py — Vector embedding
Built on sentence-transformers with BAAI/bge-m3 (1024 dimensions, Chinese + English support). Singleton pattern — model loads on first call (~570MB), then reuses. All encode operations run in a thread-pool executor via run_in_executor to avoid blocking the asyncio event loop.
When it works: embedding file chunks after upload, embedding conversation memories after each turn, embedding query text for RAG retrieval.
3.6 search_service.py — Search service
Wraps SearXNG calls: sends search requests, filters low-quality results, strips HTML tags. Uses a persistent httpx client for connection pool reuse. 5-second timeout; returns an empty list on failure (caller degrades to plain AI response).
When it works: only in web-search scenarios, called by stream_manager or the search router.
3.7 attachment_processor.py — Attachment processing
The complete file upload pipeline:
Uploaded bytes | |-- Text extraction (Kreuzberg: supports PDF/DOCX/PPTX/30+ formats) | +-- Fallback: direct UTF-8 decode (txt/md/csv/json etc.) |-- Short text (<3000 chars) -> inline as message context, skip RAG |-- Long text -> semantic chunking (cut when cosine similarity < 0.75) |-- Batch embedding (embed_texts processes all chunks in one call) +-- Store in attachment_chunks table
When it works: when a user uploads a file. Raw bytes are released after processing — nothing is written to disk.
§ 05Part 4 — Frontend layer
4.1 Component architecture
app/
|-- page.tsx Home (input + new chat)
|-- chat/[sessionId]/ Main chat page
+-- login/ Login page
components/
|-- MainThread/
| |-- MessageList.tsx Message list (scroll, stream append)
| |-- MessageBubble.tsx Single message (supports text selection + pin)
| +-- InputBar.tsx Bottom input (follows active thread)
|-- SubThread/
| |-- SideColumn.tsx Side panel container (left/right)
| |-- ThreadCard.tsx Individual sub-thread card
| +-- PinRoll.tsx Pin scroll list
|-- Layout/
| |-- ThreadNav.tsx Thread navigation (breadcrumbs)
| +-- ThreadTree.tsx Thread tree view
|-- PinMenu.tsx Floating toolbar after text selection
|-- PinStartDialog.tsx Pin confirmation dialog
|-- MergeOutput.tsx Merge output panel
|-- MergeTreeCanvas.tsx Thread tree visualization for merge
|-- SessionDrawer.tsx History session drawer
|-- MarkdownContent.tsx Markdown renderer
|-- ThemeToggle.tsx Theme toggle
+-- Mobile/
+-- MobileChatLayout.tsx Mobile layout4.2 State management (Zustand)
Three stores, each managing one dimension:
- useThreadStore — thread tree structure, active thread, message content, streaming state. The core store managing all conversation data
- useLangStore — language toggle (zh/en), persisted to localStorage
- useThemeStore — theme toggle (light/dark), persisted to localStorage
4.3 lib/ — Utility library
- api.ts — Backend API call wrapper, unified auth handling, error handling, retry logic
- sse.ts — SSE client, manages streaming connection setup, token reception, error handling, auto-redirect to login on 401
- supabase.ts — Supabase client initialization (browser-side)
- i18n.ts — Internationalization strings (zh/en toggle)
§ 06Part 5 — Data storage layer
Six core tables in Supabase PostgreSQL:
sessions | 1:N v threads (parent_thread_id self-reference -> infinite nesting tree) | 1:N v messages thread_summaries (1:1 with threads) attachment_chunks (session-level file vector chunks) conversation_memories (session-level conversation memory vectors)
- sessions — conversation sessions, linked to user_id
- threads — thread tree; parent_thread_id=null means main thread, otherwise sub-thread. Stores anchor text, position offsets, depth
- messages — message records, role=user/assistant
- thread_summaries — thread summary cache, indexed by token_budget
- attachment_chunks — file vector chunks with embedding column (pgvector)
- conversation_memories — conversation memory vectors, one per round
§ 07Part 6 — External AI services
6.1 LLM provider pool
SmartRouter manages 5 providers, all on free tiers:
- Groq — Primary provider, 5 models (llama-3.3-70b, llama-4-scout, qwen3-32b, gpt-oss-120b, llama-3.1-8b-instant), fast inference
- Cerebras — 2 models (qwen-3-235b, llama3.1-8b), 60K TPM, great for summarizer and long-context tasks
- SambaNova — 2 models (Llama-3.3-70B, Llama-4-Maverick), 100K TPM, high throughput
- Gemini — 2 models (gemini-2.5-flash/flash-lite), 250K TPM, daily quota resets at Pacific Time 00:00
- OpenRouter — 4 :free models (nemotron-super-120b, gpt-oss-120b, llama-3.3-70b, qwen3-next-80b), used as backup slots
All providers are called through LiteLLM's unified format (provider/model_id). SmartRouter scores each slot based on real-time usage and picks the best one.
6.2 bge-m3 embedding model
BAAI/bge-m3 is self-hosted on the backend server (no external API dependency). 1024 dimensions, supports Chinese and English, max input 8192 tokens. ~570MB, downloaded from HuggingFace on first startup and cached.
It handles all vectorization: file chunk embedding, conversation memory embedding, RAG query embedding. Since it's a local model, there's no rate limit — it never becomes a bottleneck.
§ 08Part 7 — CI/CD and operations
- GitHub Actions — Three-stage pipeline: unit tests -> deploy (SSH + Docker Compose + healthcheck + smoke test) -> integration tests
- smoke_test.sh — 9 curl checks: HTTPS reachable, status OK, all components healthy, embedding dimensions correct, auth rejection works
- Integration tests (test_api.py) — Hit the real live API from GitHub runners: health checks, auth verification, session lifecycle, provider verification
- Let's Encrypt — Auto-renewing TLS certificates, mounted into the nginx container
§ 09Part 8 — Full request call chain
When a user sends a message in a sub-thread, every component involved:
Browser InputBar
-> lib/sse.ts (establish SSE connection)
-> Nginx (TLS termination + forwarding)
-> stream.py (route entry)
-> stream_manager.py (orchestration)
|-- Supabase: save user message
|-- context_builder.py: build context
| |-- Supabase: read ancestor chain + summaries + history
| +-- memory_service.py: RAG retrieval
| |-- embedding_service.py: vectorize query (bge-m3)
| +-- Supabase: pgvector similarity search
|-- llm_client.py -> SmartRouter
| -> LiteLLM -> Groq/Cerebras/SambaNova/Gemini/OpenRouter
|-- Supabase: save assistant message
+-- Background tasks:
|-- Supabase: write summary
+-- embedding_service -> Supabase: write conversation memoryOne message, 12 components working together, all completing within 2–5 seconds. What the user sees is an AI reply streaming in character by character.