Write-Time Summary Maintenance: Why Not Compute on Read
Deeppin updates thread summaries asynchronously at write time rather than computing them on demand during context assembly. The trade-offs and implementation details behind this design choice.
Deeppin's context assembly relies on thread summaries — each ancestor thread needs a compressed summary to pass down to its children. The most intuitive implementation is to compute these on demand inside build_context: fetch thread history, call LLM to summarize, insert into context. But this approach has a critical flaw.
§ 01Part 1 — The problem with read-time computation
Suppose a user sends a message in a 3-level nested sub-thread. build_context needs to generate summaries for the main thread and two ancestor threads — three sequential LLM calls, adding 2–4 seconds before the AI even starts 'thinking'. The user sent a message and now waits just for context preparation.
Worse: if the user sends messages in multiple threads simultaneously (Deeppin supports concurrent conversations), each thread triggers the same summary generation, computing identical content redundantly.
§ 02Part 2 — Write-time maintenance design
Deeppin's approach: after each AI reply is saved to the database, a background task asynchronously updates the thread's summary cache. This doesn't block the response — the user is already watching the stream.
async def save_assistant_message(thread_id: str, content: str):
# 1. Synchronous DB write (blocks)
await _db(lambda: supabase.table("messages").insert({...}).execute())
# 2. Async summary update (non-blocking)
asyncio.create_task(
_update_summary_async(thread_id)
)
async def _update_summary_async(thread_id: str):
try:
thread = await get_thread(thread_id)
budget = compute_budget_for_depth(thread.depth)
await summarizer.update_summary(thread_id, budget)
except Exception as e:
logger.warning(f"Summary update failed (non-fatal): {e}")§ 03Part 3 — token_budget as cache key
A summary is compressed to a specific token budget — the same thread history produces completely different summaries at 200 tokens vs 300 tokens. The database stores the last generated summary along with the budget used:
-- thread_summaries table thread_id uuid PRIMARY KEY summary text -- compressed content token_budget int -- budget this summary was generated at updated_at timestamptz -- Read with budget check SELECT summary FROM thread_summaries WHERE thread_id = $1 AND token_budget = $2
Cache hit: use directly. Cache miss (e.g., thread depth changed): fall back to real-time computation, asynchronously cache the result.
§ 04Part 4 — Fallback path
Write-time updates are best-effort: if the background task fails (e.g., Groq rate limit), the current user request is unaffected. build_context falls back to real-time computation when the cache misses:
async def get_or_create_summary(thread_id: str, budget: int) -> str:
# 1. Try cache
cached = await get_cached_summary(thread_id, budget)
if cached:
return cached
# 2. Fallback: compute now (adds latency, but doesn't break)
summary = await summarizer.compute_summary(thread_id, budget)
# 3. Cache asynchronously for next time
asyncio.create_task(cache_summary(thread_id, budget, summary))
return summary§ 05Part 5 — Trade-offs
- Write-time: lower p99 latency, but summary may lag one message behind (async update)
- Read-time: summary always current, but significantly increases time-to-first-token
- For Deeppin's use case, a slightly stale summary is perfectly acceptable — summaries are compressed approximations, one message difference is negligible
- Reducing time-to-first-token has directly perceptible impact on user experience