Within-Thread Context: Sliding Window and Summary Prefix
How a single conversation thread maintains long-term memory without blowing the token window — sliding window, write-time summaries, and two-phase truncation.
Conversation systems face a fundamental tension: a user's conversation history grows without bound, but an LLM's context window is fixed. Passing everything in is impractical — even if the model supports 128K tokens, cost scales linearly with conversation length and latency grows.
Deeppin solves this within a single thread using two mechanisms: a sliding window to control input size, and a summary prefix to preserve long-term memory.
§ 01Part 1 — Sliding window
Each LLM call passes only the most recent 10 messages (user + assistant combined). The number 10 balances "enough recent context" against token cost — for most conversations, the last 5 exchanges are sufficient for the model to understand the current topic.
# context_builder.py
_THREAD_MSG_LIMIT = 10
msgs_res = await _db(
lambda: get_supabase().table("messages")
.select("role, content")
.eq("thread_id", thread_id)
.order("created_at", desc=True) # take most recent N
.limit(_THREAD_MSG_LIMIT)
.execute()
)
messages = list(reversed(msgs_res.data or [])) # restore chronological orderThe reversal matters: the database returns newest-first, but the LLM needs oldest-first chronological order.
§ 02Part 2 — Summary prefix
Truncated history cannot simply be discarded. In a 30-round conversation, rounds 1–20 might contain critical background — the user's role, the specific scenario being discussed, decisions already made. Losing this makes the AI appear amnesiac.
The solution: when total message count exceeds the window, compress the out-of-window history into a summary and prepend it to the context.
total = count_res.count or 0
summary_prefix = []
if total > _THREAD_MSG_LIMIT:
summary = await _get_or_create_summary(thread_id, budget=800)
if summary:
summary_prefix = [{
"role": "system",
"content": f"[Conversation summary (before message {_THREAD_MSG_LIMIT + 1})]\n{summary}",
}]Final structure sent to the LLM:
[system] Conversation summary (before message 11) ← only when total > 10 [user] message 10 (oldest in window) [assistant] ... ... [user] message 1 (most recent)
§ 03Part 3 — Summary generation and caching
Summaries are generated by a lightweight summarizer model, structured by topic:
# Output format [Topic: Authentication] User is building JWT auth for a FastAPI app. Key facts: using RS256, refresh token stored in httpOnly cookie. [Topic: Database] Switched from SQLite to Postgres for production.
Summaries are cached in the thread_summaries table (thread_id, summary text, token_budget used). The critical design choice is write-time maintenance: after each round, stream_manager proactively updates the summary — not lazily computed on the next read.
§ 04Part 4 — Two-phase truncation fallback
Even with a sliding window, individual messages can be enormous — users pasting large code blocks or articles. Two-phase truncation handles this:
Phase 1: replace oversized messages
When a single user/assistant message exceeds 3,000 characters, rather than hard-truncating it, replace it with a placeholder that directs the LLM to the RAG system messages:
if len(m["content"]) > _MAX_SINGLE_MSG_CHARS:
placeholder = (
f"[User provided long text ({char_len} chars), chunked and vector-indexed. "
f"Relevant passages injected via system context above. "
f"Text beginning for reference: {m['content'][:200]}…]"
)Phase 2: drop oldest conversation messages
If total characters still exceed 18,000 (~7,200 tokens) after phase 1, drop the oldest user/assistant messages one by one. System messages (summaries, anchors, RAG) are never dropped — they are the skeleton of the LLM's situational understanding.
The two-phase logic: first preserve semantics by redirecting to RAG, only then physically delete, and always preserve the most recent conversation.
§ 05Part 5 — Deeppin's Topic-based summary mechanism
Summaries are not free-form paragraphs — Deeppin enforces topic-grouped formatting so the LLM can quickly locate relevant information when reading a summary. This design runs through three stages: generation, incremental update, and inline injection.
Topic format
All summaries use a unified [Topic: name] prefix grouping. A typical summary looks like this:
[Topic: Authentication] User is building FastAPI JWT auth, using RS256, refresh token in httpOnly cookie. [Topic: Database] Switched from SQLite to Postgres for production. [Topic: Deployment] Oracle ARM free tier, Docker Compose, Nginx reverse proxy.
Benefits: highly structured yet low-overhead — no JSON parsing needed, natively understood by LLMs, and human-readable.
Generation path 1: META inline (zero extra calls)
The most efficient path: have the chat model generate the summary alongside its answer, eliminating a separate summarizer call. This is done by injecting a META directive at the end of the context:
# META directive injected by chat_stream()
full_messages.append({
"role": "system",
"content": (
f"Internal summary rules (for the JSON below only, never in the main answer):"
f"Group by topic, each line: [Topic: name] + key facts/conclusions/details;"
f"unlimited topics, strictly reuse existing topic labels;"
f"language matches user; total ≤ {summary_budget} chars."
"\n\n"
"Important: the main answer must use natural language, "
"never use [Topic:] format in the answer.\n\n"
"After completing your answer, append this JSON block:\n"
f"{META_SENTINEL}\n"
f'{{{json_template}}}'
),
})Output structure: main answer → META_SENTINEL delimiter → JSON (containing summary and optional title). stream_manager intercepts the META portion during streaming, never pushes it to the frontend, parses it, and writes directly to the thread_summaries table.
Generation path 2: standalone summarize() (first-time generation)
When META parsing fails (model didn't output it, or format error) or during historical data migration, fall back to a standalone summarizer call:
async def summarize(text: str, max_tokens: int) -> str:
return await _summarizer_call(
messages=[{
"role": "user",
"content": (
f"Compress the following into a summary of no more than "
f"{max_tokens} tokens, grouped by topic. Format: "
f"[Topic: name] key facts and specific details. "
f"Preserve core information, match language of source:\n\n{text}"
),
}],
max_tokens=max_tokens,
)Generation path 3: merge_summary() (incremental update)
Existing summary + new conversation round → incremental merge, avoiding regeneration from full history each time:
async def merge_summary(existing_summary: str, new_exchange: str, max_tokens: int) -> str:
return await _summarizer_call(
messages=[{
"role": "user",
"content": (
f"Below is an existing summary of a conversation, plus a new exchange.\n"
f"Merge the new content into the summary, grouped by topic. "
f"Format: [Topic: name] key facts and specific details. "
f"Strictly reuse existing topic labels, do not rename. "
f"Keep within {max_tokens} tokens, match source language, "
f"output only the summary:\n\n"
f"[Existing summary]\n{existing_summary}\n\n"
f"[New exchange]\n{new_exchange}"
),
}],
max_tokens=max_tokens,
)Merge is cheaper than full summarize: input is an already-compressed summary + one round, not the entire history. The tradeoff is error accumulation — after many rounds of merging, early topic details gradually blur. But this is acceptable for Deeppin: the user's current focus is always in the most recent rounds; early topics only need key conclusions preserved.
Write timing and priority
Summary updates happen in a background task after each conversation round, never blocking SSE stream delivery:
# stream_manager.py — background task priority
# 1. META parsed successfully → write directly (zero extra LLM calls)
if summary:
asyncio.create_task(_save_summary(thread_id, summary, summary_budget))
else:
# 2. META failed → fall back to merge_summary
asyncio.create_task(
_fallback_update_summary(thread_id, depth, user_content, full_content)
)
# _fallback_update_summary internal logic:
# has existing summary → merge_summary(existing, new_exchange, budget)
# no existing summary → summarize(new_exchange, budget) (first-time)Token budget by nesting depth
Summary token budget decreases with nesting depth — deeper threads get less summary space, ensuring total context stays bounded:
_BUDGETS_BY_DEPTH = [800, 500, 300, 150]
def _budget_for_depth(depth_from_root: int) -> int:
return _BUDGETS_BY_DEPTH[min(depth_from_root, len(_BUDGETS_BY_DEPTH) - 1)]Main thread: 800 tokens, first sub-thread: 500, second level: 300, third and deeper: 150. This gradient comes from an observation: deeper sub-threads are more narrowly focused with higher information density, actually needing less summary space.