Article · context

Within-Thread Context: Sliding Window and Summary Prefix

How a single conversation thread maintains long-term memory without blowing the token window — sliding window, write-time summaries, and two-phase truncation.

2026-04-1525 min readcontextmemoryarchitecture

Conversation systems face a fundamental tension: a user's conversation history grows without bound, but an LLM's context window is fixed. Passing everything in is impractical — even if the model supports 128K tokens, cost scales linearly with conversation length and latency grows.

Deeppin solves this within a single thread using two mechanisms: a sliding window to control input size, and a summary prefix to preserve long-term memory.

§ 01Part 1 — Sliding window

Each LLM call passes only the most recent 10 messages (user + assistant combined). The number 10 balances "enough recent context" against token cost — for most conversations, the last 5 exchanges are sufficient for the model to understand the current topic.

# context_builder.py
_THREAD_MSG_LIMIT = 10

msgs_res = await _db(
    lambda: get_supabase().table("messages")
    .select("role, content")
    .eq("thread_id", thread_id)
    .order("created_at", desc=True)   # take most recent N
    .limit(_THREAD_MSG_LIMIT)
    .execute()
)
messages = list(reversed(msgs_res.data or []))  # restore chronological order

The reversal matters: the database returns newest-first, but the LLM needs oldest-first chronological order.

§ 02Part 2 — Summary prefix

Truncated history cannot simply be discarded. In a 30-round conversation, rounds 1–20 might contain critical background — the user's role, the specific scenario being discussed, decisions already made. Losing this makes the AI appear amnesiac.

The solution: when total message count exceeds the window, compress the out-of-window history into a summary and prepend it to the context.

total = count_res.count or 0

summary_prefix = []
if total > _THREAD_MSG_LIMIT:
    summary = await _get_or_create_summary(thread_id, budget=800)
    if summary:
        summary_prefix = [{
            "role": "system",
            "content": f"[Conversation summary (before message {_THREAD_MSG_LIMIT + 1})]\n{summary}",
        }]

Final structure sent to the LLM:

[system]  Conversation summary (before message 11)  ← only when total > 10
[user]    message 10 (oldest in window)
[assistant] ...
...
[user]    message 1 (most recent)

§ 03Part 3 — Summary generation and caching

Summaries are generated by a lightweight summarizer model, structured by topic:

# Output format
[Topic: Authentication] User is building JWT auth for a FastAPI app.
  Key facts: using RS256, refresh token stored in httpOnly cookie.
[Topic: Database] Switched from SQLite to Postgres for production.

Summaries are cached in the thread_summaries table (thread_id, summary text, token_budget used). The critical design choice is write-time maintenance: after each round, stream_manager proactively updates the summary — not lazily computed on the next read.

iWrite-time maintenance costs one extra summarizer call per round, but buys zero read-path latency and no race conditions. Read-time computation risks stale summaries and concurrent write conflicts in multi-thread scenarios.

§ 04Part 4 — Two-phase truncation fallback

Even with a sliding window, individual messages can be enormous — users pasting large code blocks or articles. Two-phase truncation handles this:

Phase 1: replace oversized messages

When a single user/assistant message exceeds 3,000 characters, rather than hard-truncating it, replace it with a placeholder that directs the LLM to the RAG system messages:

if len(m["content"]) > _MAX_SINGLE_MSG_CHARS:
    placeholder = (
        f"[User provided long text ({char_len} chars), chunked and vector-indexed. "
        f"Relevant passages injected via system context above. "
        f"Text beginning for reference: {m['content'][:200]}…]"
    )

Phase 2: drop oldest conversation messages

If total characters still exceed 18,000 (~7,200 tokens) after phase 1, drop the oldest user/assistant messages one by one. System messages (summaries, anchors, RAG) are never dropped — they are the skeleton of the LLM's situational understanding.

The two-phase logic: first preserve semantics by redirecting to RAG, only then physically delete, and always preserve the most recent conversation.

§ 05Part 5 — Deeppin's Topic-based summary mechanism

Summaries are not free-form paragraphs — Deeppin enforces topic-grouped formatting so the LLM can quickly locate relevant information when reading a summary. This design runs through three stages: generation, incremental update, and inline injection.

Topic format

All summaries use a unified [Topic: name] prefix grouping. A typical summary looks like this:

[Topic: Authentication] User is building FastAPI JWT auth, using RS256, refresh token in httpOnly cookie.
[Topic: Database] Switched from SQLite to Postgres for production.
[Topic: Deployment] Oracle ARM free tier, Docker Compose, Nginx reverse proxy.

Benefits: highly structured yet low-overhead — no JSON parsing needed, natively understood by LLMs, and human-readable.

Generation path 1: META inline (zero extra calls)

The most efficient path: have the chat model generate the summary alongside its answer, eliminating a separate summarizer call. This is done by injecting a META directive at the end of the context:

# META directive injected by chat_stream()
full_messages.append({
    "role": "system",
    "content": (
        f"Internal summary rules (for the JSON below only, never in the main answer):"
        f"Group by topic, each line: [Topic: name] + key facts/conclusions/details;"
        f"unlimited topics, strictly reuse existing topic labels;"
        f"language matches user; total ≤ {summary_budget} chars."
        "\n\n"
        "Important: the main answer must use natural language, "
        "never use [Topic:] format in the answer.\n\n"
        "After completing your answer, append this JSON block:\n"
        f"{META_SENTINEL}\n"
        f'{{{json_template}}}'
    ),
})

Output structure: main answer → META_SENTINEL delimiter → JSON (containing summary and optional title). stream_manager intercepts the META portion during streaming, never pushes it to the frontend, parses it, and writes directly to the thread_summaries table.

iCritical detail: "strictly reuse existing topic labels." If the previous summary has [Topic: Authentication], and this round adds auth-related info, the model must reuse [Topic: Authentication] rather than renaming it to [Topic: JWT Auth] — otherwise summaries balloon into countless similar topics over multiple rounds. summary_budget controls total length, defaulting to 100 chars (dynamically adjusted by stream_manager based on depth).

Generation path 2: standalone summarize() (first-time generation)

When META parsing fails (model didn't output it, or format error) or during historical data migration, fall back to a standalone summarizer call:

async def summarize(text: str, max_tokens: int) -> str:
    return await _summarizer_call(
        messages=[{
            "role": "user",
            "content": (
                f"Compress the following into a summary of no more than "
                f"{max_tokens} tokens, grouped by topic. Format: "
                f"[Topic: name] key facts and specific details. "
                f"Preserve core information, match language of source:\n\n{text}"
            ),
        }],
        max_tokens=max_tokens,
    )

Generation path 3: merge_summary() (incremental update)

Existing summary + new conversation round → incremental merge, avoiding regeneration from full history each time:

async def merge_summary(existing_summary: str, new_exchange: str, max_tokens: int) -> str:
    return await _summarizer_call(
        messages=[{
            "role": "user",
            "content": (
                f"Below is an existing summary of a conversation, plus a new exchange.\n"
                f"Merge the new content into the summary, grouped by topic. "
                f"Format: [Topic: name] key facts and specific details. "
                f"Strictly reuse existing topic labels, do not rename. "
                f"Keep within {max_tokens} tokens, match source language, "
                f"output only the summary:\n\n"
                f"[Existing summary]\n{existing_summary}\n\n"
                f"[New exchange]\n{new_exchange}"
            ),
        }],
        max_tokens=max_tokens,
    )

Merge is cheaper than full summarize: input is an already-compressed summary + one round, not the entire history. The tradeoff is error accumulation — after many rounds of merging, early topic details gradually blur. But this is acceptable for Deeppin: the user's current focus is always in the most recent rounds; early topics only need key conclusions preserved.

Write timing and priority

Summary updates happen in a background task after each conversation round, never blocking SSE stream delivery:

# stream_manager.py — background task priority

# 1. META parsed successfully → write directly (zero extra LLM calls)
if summary:
    asyncio.create_task(_save_summary(thread_id, summary, summary_budget))
else:
    # 2. META failed → fall back to merge_summary
    asyncio.create_task(
        _fallback_update_summary(thread_id, depth, user_content, full_content)
    )

# _fallback_update_summary internal logic:
#   has existing summary → merge_summary(existing, new_exchange, budget)
#   no existing summary → summarize(new_exchange, budget) (first-time)

Token budget by nesting depth

Summary token budget decreases with nesting depth — deeper threads get less summary space, ensuring total context stays bounded:

_BUDGETS_BY_DEPTH = [800, 500, 300, 150]

def _budget_for_depth(depth_from_root: int) -> int:
    return _BUDGETS_BY_DEPTH[min(depth_from_root, len(_BUDGETS_BY_DEPTH) - 1)]

Main thread: 800 tokens, first sub-thread: 500, second level: 300, third and deeper: 150. This gradient comes from an observation: deeper sub-threads are more narrowly focused with higher information density, actually needing less summary space.