Article · architecture

Deeppin System Components: What Every Part Does and How They Work Together

From Nginx to Supabase, from SmartRouter to bge-m3 — a component-by-component breakdown of what each part is, why it exists, when it's invoked, and how data flows through it.

2026-04-1642 min readarchitecturecomponentsoverview

Deeppin uses over a dozen components working together to deliver the "pin-and-explore" deep thinking experience. This article introduces each component's responsibility, when it's invoked, and how it relates to other parts. We start with the big picture, then go layer by layer.

§ 01Architecture overview

Fig. 1·system-connectivity

User Browser
  |
  |-- HTTPS --> Vercel (Next.js frontend)
  |              |-- components/    UI rendering
  |              |-- stores/        Zustand state
  |              +-- lib/sse.ts     SSE client
  |
  +-- HTTPS --> Oracle Cloud (Docker Compose)
                 |-- Nginx          Reverse proxy + TLS
                 |-- FastAPI         Backend main process
                 |    |-- routers/       9 route modules
                 |    |-- services/      7 service modules
                 |    +-- db/            Supabase connector
                 +-- SearXNG         Search engine

Below, we walk through each component in the order data flows through them — from the outside in.

§ 02Part 1 — Infrastructure layer

1.1 Nginx — Reverse proxy

Nginx is the first door every request passes through to reach the backend. It handles TLS termination (Let's Encrypt certificates), HTTP-to-HTTPS redirection, and forwarding requests to FastAPI.

The most critical configuration for Deeppin is the three SSE-related directives:

proxy_buffering off;     # Disable buffering — otherwise SSE streams batch up
proxy_cache off;         # Disable caching
proxy_read_timeout 300s; # LLM generation can be slow

When it works: every single request reaching the backend passes through Nginx. It's the always-on gateway.

1.2 Docker Compose — Container orchestration

Three services (backend, searxng, nginx) are managed by Docker Compose. Startup order is guaranteed through a healthcheck chain:

backend + searxng start in parallel
        |
        v
backend healthcheck passes
(/health aggregates searxng + supabase + embedding + groq checks)
        |
        v
  nginx starts, begins accepting traffic

The backend healthcheck runs every 15 seconds with a 45-second start_period (to give the embedding model time to load). Nginx sets depends_on: backend: condition: service_healthy, ensuring users never hit a half-initialized service.

1.3 Supabase — Database + Auth

Supabase provides two core capabilities:

PostgreSQL database: stores sessions, threads, messages, thread_summaries, attachment_chunks, and conversation_memories — six tables
Auth: user signup/login, JWT issuance and verification, RLS (Row Level Security) for per-user data isolation

The backend accesses Supabase with two keys: service_role_key (admin privileges, for JWT verification and backend operations) and anon_key (user-scoped, works with RLS). The frontend only holds the anon_key.

When it works: nearly every API request reads from or writes to Supabase — creating sessions, saving messages, reading history, storing and searching vectors.

1.4 SearXNG — Meta search engine

SearXNG is a self-hosted search engine aggregator. It sends the user's query to Google, Bing, DuckDuckGo, and other engines simultaneously, then deduplicates and returns the results.

Deeppin uses it for the "web search" feature: when a question needs real-time information (news, prices, weather, etc.), the backend searches via SearXNG first, then feeds the results to an LLM for synthesis.

# Backend calls via JSON API
GET http://searxng:8080/search?q=query&format=json

When it works: only when classify_search_intent() determines the user's question needs live information. Regular conversations never trigger it.

§ 03Part 2 — Backend route layer (routers/)

FastAPI's 9 router modules each handle one category of API endpoints:

2.1 health.py — Health checks

Exposes /health (aggregated dependency status) and /health/providers (individually verifies every LLM provider+key combination). Docker healthchecks and CI smoke tests both depend on it.

When called: Docker automatically calls /health every 15 seconds; the smoke test calls it post-deployment; also useful for manual system status checks.

2.2 sessions.py — Session management

CRUD operations: create session, list a user's sessions, get session details (with thread tree), delete session. Also supports bulk-fetching all messages under a session (used for merge output).

When called: when the user creates a new conversation, opens a past one, or deletes one.

2.3 threads.py — Thread management

Creates sub-threads (pins), fetches thread details and message history, generates follow-up suggestions. Creating a sub-thread asynchronously triggers LLM-generated title and suggested questions.

When called: when a user selects text and pins it to create a sub-thread; when opening a sub-thread to view history.

2.4 stream.py — SSE streaming endpoint

The core endpoint: POST /api/threads/:id/chat. Receives the user message and returns an SSE stream. It bridges the frontend to stream_manager.

When called: every time the user sends a message, whether in the main thread or a sub-thread.

2.5 search.py — Web search

An SSE streaming endpoint. First queries SearXNG, then has the LLM answer the user's question based on search results. Supports both automatic intent detection and manual trigger.

When called: auto-triggered when a question is classified as needing real-time information, or when the user explicitly clicks the search button.

2.6 merge.py — Merge output

Collects the main thread and all sub-thread conversations, then has the LLM merge them into structured output (free summary / bullet points / structured analysis / raw transcript). Streamed.

When called: when the user clicks the "Merge Output" button.

2.7 attachments.py — File upload

Receives user-uploaded files and hands them to attachment_processor (text extraction → chunking → embedding → DB storage).

When called: when the user uploads a file (PDF, Word, code files, etc.) during a conversation.

2.8 relevance.py — Relevance assessment

Before merge output, the LLM evaluates each sub-thread's relevance to the main thread, deciding which sub-threads should be selected by default for merging.

When called: automatically invoked once when the user opens the merge panel, before rendering.

2.9 users.py — User configuration

Gets and updates user metadata (preferences, settings). Built on Supabase Auth's user_metadata field.

When called: when the user modifies their personal settings.

§ 04Part 3 — Backend service layer (services/)

The route layer handles requests and responses; the service layer handles the actual business logic. The 7 service modules are the backend's core.

3.1 llm_client.py — SmartRouter

The unified entry point for every LLM call in the system. Houses the SmartRouter, which manages 15 models and multiple API keys across 5 providers (Groq, Cerebras, SambaNova, Gemini, OpenRouter).

Caller (stream_manager / search / merge)
        |
        v
  SmartRouter._pick_slot(group)
        |
        |-- Score all slots by availability
        |-- score = min(RPM_remaining%, TPM_remaining%, RPD_remaining%)
        |-- Recently failed slots get extra penalty (30s half-life)
        +-- All exhausted -> pick slot with soonest recovery
        |
        v
  litellm.acompletion(model, messages, api_key)
        |
        |-- Success -> record_success()
        +-- Failure -> record_failure() -> retry next slot
                        +-- All slots in group failed -> fallback chain

Models are grouped into 4 tiers: chat (main conversations), merge (merge output), summarizer (summaries/classification), vision (image understanding). Fallback chain: chat->summarizer, merge->chat->summarizer.

When it works: every single LLM call goes through SmartRouter — main chat, summaries, merge, search intent classification, sub-thread title generation, relevance assessment.

3.2 stream_manager.py — SSE stream manager

The most complex service in the entire system. It orchestrates the complete flow for one conversation turn:

User message arrives
  |
  |-- 1. yield ping (prevent connection timeout)
  |-- 2. Save user message to DB
  |-- 3. Fetch thread metadata (depth / session_id / is_first_round)
  |-- 4. Build context (context_builder) + RAG injection (memory_service)
  |-- 5. Detect search intent (classify_search_intent)
  |     |-- Needs search -> yield search event, use search_service
  |     +-- No search -> continue
  |-- 6. Call LLM streaming (chat_stream)
  |     |-- yield tokens to frontend in real time
  |     +-- Strip META block in real time (summary + title)
  |-- 7. Save assistant message to DB
  |-- 8. yield done
  +-- 9. Background tasks (_track lifecycle)
        |-- Write summary from META (fallback: merge_summary)
        |-- Write title on first main-thread round
        +-- Write conversation_memory embedding every N rounds

When it works: every message the user sends runs through this complete pipeline.

3.3 context_builder.py — Context construction

Assembles the messages array for each LLM call. The core strategy is "deeper = more compressed":

# Token budgets by depth
_BUDGETS_BY_DEPTH = [800, 500, 300, 150]

# Main thread: summary (if >10 messages) + RAG + last 10 messages
# Sub-thread: ancestor summary chain + anchor text + RAG + current thread history

When it works: build_context() is called before every LLM invocation. Main threads and sub-threads follow different construction logic.

3.4 memory_service.py — Dual-track RAG memory

Manages two parallel RAG retrieval tracks:

attachment_chunks: vector chunks from user-uploaded files. When the user asks "what does paragraph three say?", it precisely recalls that chunk
conversation_memories: vectorized summaries of each conversation turn. When the user asks "what did we discuss earlier?", it recalls relevant history

When it works: context_builder calls retrieve_rag_context() to search both tracks concurrently when building context; stream_manager calls store_conversation_memory() after each turn to store new memories.

3.5 embedding_service.py — Vector embedding

Built on sentence-transformers with BAAI/bge-m3 (1024 dimensions, Chinese + English support). Singleton pattern — model loads on first call (~570MB), then reuses. All encode operations run in a thread-pool executor via run_in_executor to avoid blocking the asyncio event loop.

When it works: embedding file chunks after upload, embedding conversation memories after each turn, embedding query text for RAG retrieval.

3.6 search_service.py — Search service

Wraps SearXNG calls: sends search requests, filters low-quality results, strips HTML tags. Uses a persistent httpx client for connection pool reuse. 5-second timeout; returns an empty list on failure (caller degrades to plain AI response).

When it works: only in web-search scenarios, called by stream_manager or the search router.

3.7 attachment_processor.py — Attachment processing

The complete file upload pipeline:

Uploaded bytes
  |
  |-- Text extraction (Kreuzberg: supports PDF/DOCX/PPTX/30+ formats)
  |    +-- Fallback: direct UTF-8 decode (txt/md/csv/json etc.)
  |-- Short text (<3000 chars) -> inline as message context, skip RAG
  |-- Long text -> semantic chunking (cut when cosine similarity < 0.75)
  |-- Batch embedding (embed_texts processes all chunks in one call)
  +-- Store in attachment_chunks table

When it works: when a user uploads a file. Raw bytes are released after processing — nothing is written to disk.

§ 05Part 4 — Frontend layer

4.1 Component architecture

app/
  |-- page.tsx              Home (input + new chat)
  |-- chat/[sessionId]/     Main chat page
  +-- login/                Login page

components/
  |-- MainThread/
  |    |-- MessageList.tsx    Message list (scroll, stream append)
  |    |-- MessageBubble.tsx  Single message (supports text selection + pin)
  |    +-- InputBar.tsx       Bottom input (follows active thread)
  |-- SubThread/
  |    |-- SideColumn.tsx     Side panel container (left/right)
  |    |-- ThreadCard.tsx     Individual sub-thread card
  |    +-- PinRoll.tsx        Pin scroll list
  |-- Layout/
  |    |-- ThreadNav.tsx      Thread navigation (breadcrumbs)
  |    +-- ThreadTree.tsx     Thread tree view
  |-- PinMenu.tsx             Floating toolbar after text selection
  |-- PinStartDialog.tsx      Pin confirmation dialog
  |-- MergeOutput.tsx         Merge output panel
  |-- MergeTreeCanvas.tsx     Thread tree visualization for merge
  |-- SessionDrawer.tsx       History session drawer
  |-- MarkdownContent.tsx     Markdown renderer
  |-- ThemeToggle.tsx         Theme toggle
  +-- Mobile/
       +-- MobileChatLayout.tsx  Mobile layout

4.2 State management (Zustand)

Three stores, each managing one dimension:

useThreadStore — thread tree structure, active thread, message content, streaming state. The core store managing all conversation data
useLangStore — language toggle (zh/en), persisted to localStorage
useThemeStore — theme toggle (light/dark), persisted to localStorage

4.3 lib/ — Utility library

api.ts — Backend API call wrapper, unified auth handling, error handling, retry logic
sse.ts — SSE client, manages streaming connection setup, token reception, error handling, auto-redirect to login on 401
supabase.ts — Supabase client initialization (browser-side)
i18n.ts — Internationalization strings (zh/en toggle)

§ 06Part 5 — Data storage layer

Six core tables in Supabase PostgreSQL:

sessions
  | 1:N
  v
threads (parent_thread_id self-reference -> infinite nesting tree)
  | 1:N
  v
messages

thread_summaries (1:1 with threads)

attachment_chunks (session-level file vector chunks)

conversation_memories (session-level conversation memory vectors)

sessions — conversation sessions, linked to user_id
threads — thread tree; parent_thread_id=null means main thread, otherwise sub-thread. Stores anchor text, position offsets, depth
messages — message records, role=user/assistant
thread_summaries — thread summary cache, indexed by token_budget
attachment_chunks — file vector chunks with embedding column (pgvector)
conversation_memories — conversation memory vectors, one per round

§ 07Part 6 — External AI services

6.1 LLM provider pool

SmartRouter manages 5 providers, all on free tiers:

Groq — Primary provider, 5 models (llama-3.3-70b, llama-4-scout, qwen3-32b, gpt-oss-120b, llama-3.1-8b-instant), fast inference
Cerebras — 2 models (qwen-3-235b, llama3.1-8b), 60K TPM, great for summarizer and long-context tasks
SambaNova — 2 models (Llama-3.3-70B, Llama-4-Maverick), 100K TPM, high throughput
Gemini — 2 models (gemini-2.5-flash/flash-lite), 250K TPM, daily quota resets at Pacific Time 00:00
OpenRouter — 4 :free models (nemotron-super-120b, gpt-oss-120b, llama-3.3-70b, qwen3-next-80b), used as backup slots

All providers are called through LiteLLM's unified format (provider/model_id). SmartRouter scores each slot based on real-time usage and picks the best one.

6.2 bge-m3 embedding model

BAAI/bge-m3 is self-hosted on the backend server (no external API dependency). 1024 dimensions, supports Chinese and English, max input 8192 tokens. ~570MB, downloaded from HuggingFace on first startup and cached.

It handles all vectorization: file chunk embedding, conversation memory embedding, RAG query embedding. Since it's a local model, there's no rate limit — it never becomes a bottleneck.

§ 08Part 7 — CI/CD and operations

GitHub Actions — Three-stage pipeline: unit tests -> deploy (SSH + Docker Compose + healthcheck + smoke test) -> integration tests
smoke_test.sh — 9 curl checks: HTTPS reachable, status OK, all components healthy, embedding dimensions correct, auth rejection works
Integration tests (test_api.py) — Hit the real live API from GitHub runners: health checks, auth verification, session lifecycle, provider verification
Let's Encrypt — Auto-renewing TLS certificates, mounted into the nginx container

§ 09Part 8 — Full request call chain

When a user sends a message in a sub-thread, every component involved:

Browser InputBar
  -> lib/sse.ts (establish SSE connection)
    -> Nginx (TLS termination + forwarding)
      -> stream.py (route entry)
        -> stream_manager.py (orchestration)
          |-- Supabase: save user message
          |-- context_builder.py: build context
          |    |-- Supabase: read ancestor chain + summaries + history
          |    +-- memory_service.py: RAG retrieval
          |         |-- embedding_service.py: vectorize query (bge-m3)
          |         +-- Supabase: pgvector similarity search
          |-- llm_client.py -> SmartRouter
          |    -> LiteLLM -> Groq/Cerebras/SambaNova/Gemini/OpenRouter
          |-- Supabase: save assistant message
          +-- Background tasks:
               |-- Supabase: write summary
               +-- embedding_service -> Supabase: write conversation memory

One message, 12 components working together, all completing within 2–5 seconds. What the user sees is an AI reply streaming in character by character.