Article · architecture

Write-Time Summary Maintenance: Why Not Compute on Read

Deeppin updates thread summaries asynchronously at write time rather than computing them on demand during context assembly. The trade-offs and implementation details behind this design choice.

2026-04-159 min readarchitecturememoryperformance

Deeppin's context assembly relies on thread summaries — each ancestor thread needs a compressed summary to pass down to its children. The most intuitive implementation is to compute these on demand inside build_context: fetch thread history, call LLM to summarize, insert into context. But this approach has a critical flaw.

§ 01Part 1 — The problem with read-time computation

Suppose a user sends a message in a 3-level nested sub-thread. build_context needs to generate summaries for the main thread and two ancestor threads — three sequential LLM calls, adding 2–4 seconds before the AI even starts 'thinking'. The user sent a message and now waits just for context preparation.

Worse: if the user sends messages in multiple threads simultaneously (Deeppin supports concurrent conversations), each thread triggers the same summary generation, computing identical content redundantly.

§ 02Part 2 — Write-time maintenance design

Deeppin's approach: after each AI reply is saved to the database, a background task asynchronously updates the thread's summary cache. This doesn't block the response — the user is already watching the stream.

async def save_assistant_message(thread_id: str, content: str):
    # 1. Synchronous DB write (blocks)
    await _db(lambda: supabase.table("messages").insert({...}).execute())
    
    # 2. Async summary update (non-blocking)
    asyncio.create_task(
        _update_summary_async(thread_id)
    )

async def _update_summary_async(thread_id: str):
    try:
        thread = await get_thread(thread_id)
        budget = compute_budget_for_depth(thread.depth)
        await summarizer.update_summary(thread_id, budget)
    except Exception as e:
        logger.warning(f"Summary update failed (non-fatal): {e}")

§ 03Part 3 — token_budget as cache key

A summary is compressed to a specific token budget — the same thread history produces completely different summaries at 200 tokens vs 300 tokens. The database stores the last generated summary along with the budget used:

-- thread_summaries table
thread_id    uuid PRIMARY KEY
summary      text         -- compressed content
token_budget int          -- budget this summary was generated at
updated_at   timestamptz

-- Read with budget check
SELECT summary FROM thread_summaries
WHERE thread_id = $1 AND token_budget = $2

Cache hit: use directly. Cache miss (e.g., thread depth changed): fall back to real-time computation, asynchronously cache the result.

§ 04Part 4 — Fallback path

Write-time updates are best-effort: if the background task fails (e.g., Groq rate limit), the current user request is unaffected. build_context falls back to real-time computation when the cache misses:

async def get_or_create_summary(thread_id: str, budget: int) -> str:
    # 1. Try cache
    cached = await get_cached_summary(thread_id, budget)
    if cached:
        return cached
    
    # 2. Fallback: compute now (adds latency, but doesn't break)
    summary = await summarizer.compute_summary(thread_id, budget)
    
    # 3. Cache asynchronously for next time
    asyncio.create_task(cache_summary(thread_id, budget, summary))
    return summary

§ 05Part 5 — Trade-offs

Write-time: lower p99 latency, but summary may lag one message behind (async update)
Read-time: summary always current, but significantly increases time-to-first-token
For Deeppin's use case, a slightly stale summary is perfectly acceptable — summaries are compressed approximations, one message difference is negligible
Reducing time-to-first-token has directly perceptible impact on user experience