Article · RAG

Handling Long Text: From Chunking to RAG Injection

When a user sends very long text or uploads a file, how the system chunks, embeds, stores, and later precisely recalls relevant passages in subsequent conversation.

2026-04-159 min readRAGlong-textchunking

Users pasting long passages or uploading files is a common scenario. Stuffing the entire content into context creates two problems: it can blow the token window, and LLM attention to content in the middle of long documents degrades significantly (the "Lost in the Middle" phenomenon).

Deeppin's approach: move oversized content out of the direct context, build a vector index, and retrieve only relevant passages on demand rather than passing everything every time.

§ 01Part 1 — Trigger condition

LONG_TEXT_THRESHOLD = 800  # characters

if len(user_content) > LONG_TEXT_THRESHOLD:
    await store_long_text_chunks(session_id, user_content, label="user_long_text")

§ 02Part 2 — Chunking strategy

LangChain's RecursiveCharacterTextSplitter, cutting at semantic boundaries:

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunk_size=800: ~320 tokens, enough to express a complete idea
chunk_overlap=100: adjacent chunks overlap to avoid semantic truncation at boundaries
Separator priority: paragraph > sentence > word > character

§ 03Part 3 — Embedding and storage

vecs = await embed_texts(chunks)  # batch — single call

rows = [
    {
        "session_id": session_id,
        "filename": label,
        "chunk_index": i,
        "content": chunk,
        "embedding": format_vector(vec),
    }
    for i, (chunk, vec) in enumerate(zip(chunks, vecs))
]
await _db(lambda: sb.table("attachment_chunks").insert(rows).execute())

Batch embedding processes all chunks in a single call — several times faster than sequential calls. bge-m3 supports batch processing natively.

§ 04Part 4 — Context handling during conversation

Step 1: replace original message with a placeholder

After the long text is indexed, the original content in the user message is replaced with a placeholder, preventing the full text from being passed on every turn:

placeholder = (
    f"[User provided long text ({char_len} chars), chunked and indexed. "
    f"Relevant passages are injected via system context above. "
    f"Text beginning for reference: {m['content'][:200]}…]"
)

Step 2: on-demand RAG retrieval

On each subsequent turn, the current question retrieves the most relevant chunks. "What does paragraph three say?" retrieves that chunk. "What's the core argument?" retrieves the chunk containing it.

§ 05Part 5 — prefer_filename: the first question after file upload

There is a special case right after a file upload: chunks from older files may rank higher than the newly uploaded file, causing the first answer to reference the wrong document.

stream_manager passes a prefer_filename parameter when processing the first message after a file upload:

# stream_manager.py — detect freshly uploaded file
prefer_filename = None
if attachment_filename:
    prefer_filename = attachment_filename

context = await build_context(
    thread_id,
    query_text=user_content,
    prefer_filename=prefer_filename,  # lock to the new file
)

§ 06Part 6 — The Lost in the Middle problem

Research shows LLM attention to content in the middle of long documents is significantly lower than at the start or end (Liu et al. 2023). Chunking + on-demand retrieval sidesteps this entirely: only 3–4 relevant chunks are injected, and they appear near the top of the context in the LLM's high-attention zone.

iDeeppin's current injection order: ancestor summaries → anchor text → RAG file chunks → RAG conversation memory → current conversation. RAG chunks sit near the top, avoiding Lost in the Middle.