Article · context

How Three Context Layers Cooperate — and How They Compare to Mainstream Approaches

How the sliding window, ancestor summary chain, and RAG semantic recall divide labor, compared against full history, fixed window, and RAG-only mainstream approaches.

2026-04-1510 min readcontextarchitecturecomparison

The previous three articles introduced the sliding window, the ancestor summary chain, and dual-track RAG. This article brings them together: how they divide labor, how they complement each other, and how they compare to mainstream approaches.

§ 01Part 1 — Division of labor

┌──────────────────── messages sent to LLM ──────────────────────┐
│ [system] Main thread summary      ← inter-thread: structural   │
│ [system] Pin A summary            ← inter-thread: ancestor chain│
│ [system] Pin B summary            ← (larger budget = closer)   │
│ [system] Anchor text              ← inter-thread: focus point  │
│ [system] Relevant file chunks     ← RAG: document retrieval    │
│ [system] Relevant past exchanges  ← RAG: cross-thread recall   │
│ [user/assistant] Recent 10 msgs   ← intra-thread: recent conv  │
└─────────────────────────────────────────────────────────────────┘

Sliding window: recent conversation, most direct reference, always present
Ancestor summary chain: structural background — tells the LLM where this sub-thread came from
RAG: semantic associations that cross thread boundaries, filling gaps the summary layer can't reach

§ 02Part 2 — Mainstream approaches compared

Approach A: Full context

Pass everything to the LLM, relying on large-window models (Gemini 1M, GPT-4o 128K).

Pro: complete information
Con: cost scales linearly with conversation length; nested multi-thread causes token explosion; most content is irrelevant noise

Approach B: Fixed window

Pass only the most recent N messages, discard the rest. Early ChatGPT's approach.

Pro: simple to implement, token-predictable
Con: loses critical background after N messages; no cross-thread support at all

Approach C: RAG-only

Embed all history, retrieve the K most relevant pieces each time. The core idea behind memory systems like Mem0.

Pro: theoretically unlimited history, token-controlled
Con: structural information lost (chronological order, causality, nesting); pronoun resolution fails ('this', 'it', 'as mentioned above')

Approach D: Summary chain

Incrementally compress history, maintain an always-current summary. Claude's conversation summarization takes this approach.

Pro: structure preserved, token-controlled
Con: summaries inevitably lose detail; cross-thread semantic associations still absent

Why all three layers are indispensable

Sliding window alone: the thread forgets once it's long enough, and cross-thread awareness is zero.

Summary chain alone: knows ancestor background, but sibling threads' discussions are permanently invisible, and file content can't be searched.

RAG alone: can retrieve related content, but has no idea about the current sub-thread's structural origin (which pin opened it) — anchor information is lost.

Deeppin's combination

Summary chain (inter-thread) + fixed window (intra-thread) + RAG (cross-thread semantics) = structural completeness + recent detail + semantic association. Each mechanism's weakness is covered by the others:

Summary chain → structural background (where I came from)     ✓  detail loss  ✓
Fixed window  → recent detail (what just happened)            ✓  long-term    ×
RAG           → semantic association (where was this discussed)✓  structure    ×

Combined:
  Structural background ✓  Recent detail ✓  Semantic association ✓  Token-controlled ✓

§ 03Part 3 — Actual token distribution

In a depth-3 sub-thread, typical token allocation across context layers:

Main thread summary    150 tokens   ~9%
Pin A summary          300 tokens   ~18%
Pin B summary          500 tokens   ~30%
Anchor text             50 tokens   ~3%
RAG file chunks        400 tokens   ~24%
RAG conversation mem   200 tokens   ~12%
Current conversation    70 tokens   ~4%  (10 messages)
──────────────────────────────────────────
Total                 1670 tokens  ← well below the 7200-token cap

The ancestor summary chain is the largest portion (~61%), which makes sense — a sub-thread exists to drill into a specific background, and that background is the most important information.