Deeppin/ articles
Article · RAG

RAG and Semantic Chunking: Why the Split Strategy Determines Retrieval Quality

Fixed-size chunking is the most common source of RAG failure. Deeppin uses embedding-based semantic chunking: split by sentence first, then use vector distance to detect topic boundaries, ensuring each chunk is a complete semantic unit.

2026-04-1536 min readRAGembeddingssemantic-chunking

The most overlooked component in a RAG pipeline is chunking. No matter how good the embedding model or how fast the vector database, if chunk boundaries cut through the middle of an argument — splitting a complete thought in two — both halves are incomplete in isolation. Retrieval quality is capped the moment you split the document.

§ 01Part 1 — How fixed-size chunking fails

The most common chunking approach is fixed character count (e.g., 800 chars per chunk, 100-char overlap). Simple to implement, but with a fundamental problem: semantic boundaries have nothing to do with character count.

# Typical fixed-size failure
Original text:
  "...the attention mechanism computes weights via Query, Key, and Value matrices.
   [800-char boundary cuts here]
   This design allows the model to process all sequence positions in parallel..."

Resulting chunks:
  Chunk A: ...attention mechanism computes weights via Query, Key, Value matrices.
  Chunk B: This design allows the model to process all positions in parallel...

# User asks: "how does attention enable parallelism?"
# Chunk A mentions the mechanism but not parallelism
# Chunk B mentions parallelism but not why
# Neither chunk can answer the question alone

Overlap is a band-aid, not a fix — it only helps near boundaries, and can't guarantee a complete reasoning unit stays in a single chunk.

§ 02Part 2 — The semantic chunking approach

The core idea: let the embedding model decide where to cut. If two adjacent sentences have a sudden increase in vector distance, a topic shift has occurred — that's a natural boundary.

Deeppin's semantic chunking has three steps:

① split sentences② cosine distances③ merge by boundarys1s2s3s4s5s6s7s80.120.150.44boundary0.110.090.42boundary0.17distance0.50threshold 0.3Chunk As1 + s2 + s3Chunk Bs4 + s5 + s6Chunk Cs7 + s8embed each chunkpgvector storage
Fig. 1·semantic-chunking
  • Step 1: sentence splitting — break the document into the smallest semantic units (sentences) using punctuation rules
  • Step 2: compute cosine distance between adjacent sentence embeddings, find distance spikes (topic boundaries)
  • Step 3: merge consecutive sentences within the same topic into one chunk, stopping at topic boundaries or token limits
def semantic_chunk(text: str, model, breakpoint_threshold: float = 0.3) -> list[str]:
    sentences = split_sentences(text)  # split on 。!?.!? etc.
    if len(sentences) <= 1:
        return sentences
    
    # Batch embed all sentences
    vecs = model.encode(sentences, batch_size=16, normalize_embeddings=True)
    
    # Cosine distance between adjacent sentences
    distances = [
        1 - float(np.dot(vecs[i], vecs[i + 1]))
        for i in range(len(vecs) - 1)
    ]
    
    # Distance spikes = topic transitions
    breakpoints = [i for i, d in enumerate(distances) if d > breakpoint_threshold]
    
    # Merge sentences into chunks by boundary
    chunks, current = [], []
    for i, sentence in enumerate(sentences):
        current.append(sentence)
        if i in breakpoints or i == len(sentences) - 1:
            chunks.append("".join(current))
            current = []
    
    return chunks

§ 03Part 3 — Choosing breakpoint_threshold

The threshold controls chunk granularity. Distance = 1 − cosine similarity: 0 means identical, 2 means opposite.

  • threshold = 0.2: fine-grained — cuts at every topic shift, many short chunks (~2–3 sentences each)
  • threshold = 0.3: balanced — works for most documents (Deeppin's default)
  • threshold = 0.5: coarse — only cuts at major topic jumps, fewer longer chunks

Deeppin also adds a hard token cap: even without a topic boundary, chunks exceeding 600 tokens are force-split. Very long chunks dilute the embedding's semantic precision even though bge-m3 supports 8192-token inputs.

MAX_CHUNK_TOKENS = 600

def should_force_break(current_tokens: int, next_sentence: str) -> bool:
    return current_tokens + count_tokens(next_sentence) > MAX_CHUNK_TOKENS

§ 04Part 4 — Semantic vs. fixed-size: a real comparison

Same technical document (~3,000 words on CAP theorem in distributed systems)

Fixed chunking (chunk_size=800, overlap=100):
  → 4 chunks, avg 750 chars
  → "consistency" and "the availability trade-off" split across different chunks
  → Recall rate for "CAP theorem trade-off analysis": ~60%

Semantic chunking (threshold=0.3):
  → 7 chunks, avg 380 chars
  → Each chunk covers one concept (consistency, availability, partition tolerance, trade-offs)
  → Recall rate for "CAP theorem trade-off analysis": ~88%

More, shorter chunks look less 'efficient' than fixed chunking, but each chunk's semantic purity is higher — its embedding vector more faithfully represents its content. That's the fundamental factor that determines retrieval quality.

§ 05Part 5 — Deeppin's dual-track RAG

Semantic chunking applies to Track 1 — file attachments. Deeppin has two independent RAG data sources:

Track 1: attachment_chunks (semantic chunking)

Uploaded files are semantically chunked, each chunk embedded and stored. Retrieval does cosine similarity search in pgvector, returning the 4–5 most relevant chunks.

Track 2: conversation_memories (whole-turn embedding)

After each turn, (user message + AI reply) is embedded as a whole and stored — no chunking needed. This lets the current thread recall what other threads in the session discussed. It's the only channel for information to flow between sibling threads.

# Concurrent dual-track retrieval
chunk_res, memory_res = await asyncio.gather(
    search_attachment_chunks(query_vec, session_id, top_k=4, threshold=0.45),
    search_conversation_memories(
        query_vec, session_id, top_k=3, threshold=0.45,
        exclude_thread_id=thread_id,
    ),
)

§ 06Part 6 — Retrieval engineering details

Instruction-type query handling

"Summarize this file" — the query vector represents an action, not content. It sits far from file content vectors, falling below the similarity threshold. Fix: detect file-reference keywords, drop threshold to zero:

FILE_REF_PATTERN = re.compile(
    r"(file|document|attachment|report|this|just uploaded|summarize|what does it say)",
    re.IGNORECASE,
)

is_file_ref = bool(FILE_REF_PATTERN.search(query_text))
threshold = 0.0 if is_file_ref else 0.45

Two-layer fallback

  • Primary: threshold 0.45 — filters irrelevant results
  • Empty result fallback: zero-threshold, force-return top-k — imperfect result beats no result
  • Fresh upload: prefer_filename ensures the newly uploaded file's chunks rank above older files, which would otherwise win by sheer volume

§ 07Part 7 — Why local embedding makes semantic chunking viable

Semantic chunking embeds every sentence in the document — 5–10x more embedding calls than fixed-size chunking. With a paid API like OpenAI's text-embedding-3-small, this would multiply costs proportionally. With bge-m3 deployed locally on Oracle ARM, the marginal cost per embedding is zero:

  • 1024-dim vectors, optimized for both Chinese and English — suits Deeppin's bilingual context
  • 8192-token max input — even long chunks embed in a single call
  • Zero API cost — the high call volume from semantic chunking doesn't increase spending
  • 570MB model fits comfortably in Oracle ARM's 24GB RAM
iThe trade-off: upload processing takes longer. Fixed chunking handles 50 chunks in ~5 seconds; semantic chunking must batch-embed all sentences first, taking 10–15 seconds for the same document. This cost is paid at upload time, not query time — users can tolerate a wait when uploading, but not when asking a question.

§ 08Part 8 — Deeppin's actual implementation

The pseudocode above uses "distance > threshold" to explain the concept. The production code uses cosine similarity (not distance) and character counts (not token counts) as the chunk length unit. Below are the real parameters and core logic running in Deeppin's backend.

Entry routing: inline vs. RAG

Not every uploaded file needs the vector store. Short texts are injected directly as message context, skipping the chunking and embedding overhead:

INLINE_THRESHOLD = 3000  # characters

async def process_attachment(session_id, filename, content):
    text = await extract_text(content, filename)

    if len(text) <= INLINE_THRESHOLD:
        # Short text: inline, skip vector store
        return {"chunk_count": 0, "inline_text": text}

    # Long text: semantic chunk → embed → store
    chunks = await chunk_text_semantic(text)
    embeddings = await embed_texts(chunks)
    await store_chunks(session_id, filename, chunks, embeddings)
i3,000 characters ≈ 1,200 tokens (Chinese averages ~2 chars/token). This threshold matches the per-message truncation cap in context_builder — if a text fits into message history without truncation, there's no point routing it through RAG.

Sentence splitting

Split on blank lines first (paragraph boundaries), then on sentence-ending punctuation within each paragraph. Supports both Chinese (。!?) and English (.!?) punctuation:

_SENT_SPLIT_RE = re.compile(r'(?<=[。!?.!?])\s*')

def _split_sentences(text: str) -> list[str]:
    sentences = []
    for para in re.split(r'\n\s*\n', text):   # split on blank lines
        para = para.strip()
        if not para:
            continue
        for sent in _SENT_SPLIT_RE.split(para):  # split on sentence-end punct
            sent = sent.strip()
            if sent:
                sentences.append(sent)
    return sentences

Core semantic chunking logic

Actual parameter table:

SEMANTIC_THRESHOLD = 0.75  # break when adjacent cosine similarity < 0.75
MAX_CHUNK_CHARS    = 600   # max characters per chunk
MIN_CHUNK_CHARS    = 50    # min characters; shorter chunks keep merging

Note: this uses a similarity threshold (similarity < 0.75 → break), not the distance threshold (distance > 0.3 → break) from the pseudocode above. They're equivalent: distance = 1 − similarity, so 0.75 similarity = 0.25 distance. Similarity is more intuitive — "if two sentences aren't similar enough, cut."

async def chunk_text_semantic(text: str) -> list[str]:
    sentences = _split_sentences(text)
    if len(sentences) <= 1:
        return sentences

    # Batch embed all sentences (bge-m3 L2-normalized → dot product = cosine sim)
    embeddings = await embed_texts(sentences)

    chunks: list[str] = []
    current: list[str] = [sentences[0]]
    current_len: int = len(sentences[0])

    for i in range(1, len(sentences)):
        sent = sentences[i]
        sent_len = len(sent)

        # Dot product = cosine similarity (vectors are normalized)
        sim = sum(a * b for a, b in zip(embeddings[i-1], embeddings[i]))

        # Semantic jump or size overflow → break
        # But current chunk must meet MIN_CHUNK_CHARS first (avoid fragments)
        should_break = (
            sim < SEMANTIC_THRESHOLD or current_len + sent_len > MAX_CHUNK_CHARS
        ) and current_len >= MIN_CHUNK_CHARS

        if should_break:
            chunks.append("".join(current))
            current = [sent]
            current_len = sent_len
        else:
            current.append(sent)
            current_len += sent_len

    # Tail handling: merge too-short tail into previous chunk
    if current:
        tail = "".join(current)
        if chunks and len(tail) < MIN_CHUNK_CHARS:
            chunks[-1] += tail
        else:
            chunks.append(tail)

    return chunks
iThe MIN_CHUNK_CHARS = 50 tail merge is an important practical detail. Document endings often contain a brief conclusion or disclaimer — if left as a standalone chunk in the vector store, its embedding is semantically vague and becomes retrieval noise. Merging it into the previous chunk turns the conclusion into a natural ending for the preceding argument — semantically cleaner and better for retrieval.

Fallback: fixed-size chunking

If the embedding service is unavailable (model load failure, OOM, etc.), the system automatically degrades to sliding-window chunking:

CHUNK_SIZE    = 350   # window size
CHUNK_OVERLAP = 50    # overlap characters

def _chunk_fixed(text: str) -> list[str]:
    if len(text) <= CHUNK_SIZE:
        return [text]
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + CHUNK_SIZE, len(text))
        chunks.append(text[start:end])
        if end == len(text): break
        start = end - CHUNK_OVERLAP
    return chunks

The fallback window (350 chars) is smaller than the semantic chunking cap (600 chars), with 50-char overlap. Without semantic boundary guarantees, smaller windows + more overlap partially mitigate the mid-sentence splitting problem — not as precise as semantic chunking, but ensures the system remains functional in degraded scenarios.

Embedding service: singleton + thread pool

bge-m3 is loaded via sentence-transformers with a global singleton and double-checked locking to ensure single initialization. All encode calls run in a thread pool, never blocking the asyncio event loop:

MODEL_NAME = "BAAI/bge-m3"  # 1024-dim, Chinese+English bilingual

_model = None
_model_lock = threading.Lock()

def _get_model():
    global _model
    if _model is None:
        with _model_lock:             # double-checked locking
            if _model is None:
                _model = SentenceTransformer(MODEL_NAME)
    return _model

def _encode_sync(texts: list[str]) -> list[list[float]]:
    model = _get_model()
    # normalize_embeddings=True → dot product = cosine similarity
    vecs = model.encode(texts, normalize_embeddings=True)
    return vecs.tolist()

async def embed_texts(texts: list[str]) -> list[list[float]]:
    loop = asyncio.get_running_loop()
    return await loop.run_in_executor(None, _encode_sync, texts)

Context injection: how retrieval results enter the AI

Retrieved chunks and conversation memories are injected as system messages, positioned after summaries and anchors but before conversation history. The final structure (sub-thread example):

[
  {"role": "system", "content": "[Main thread summary] ...800 tokens..."},
  {"role": "system", "content": "[Depth-1 sub-thread summary] ...500 tokens..."},
  {"role": "system", "content": 'Anchor: \"the text the user selected\"'},
  {"role": "system", "content": "[RAG] File chunks:\n  [report.pdf chunk 3] ...\n  [report.pdf chunk 7] ..."},
  {"role": "system", "content": "[RAG] Conversation memory:\n  User: ...\n  AI: ..."},
  {"role": "user",   "content": "recent 10 messages..."},
  {"role": "assistant", "content": "..."},
  ...
]

Total context is capped at 18,000 characters (~7,200 tokens). Oversized user/assistant messages are replaced with placeholders pointing to the RAG system messages; if still over limit, the oldest conversation messages are dropped one by one — system messages (summaries, anchors, RAG) are never removed.