RAG and Semantic Chunking: Why the Split Strategy Determines Retrieval Quality
Fixed-size chunking is the most common source of RAG failure. Deeppin uses embedding-based semantic chunking: split by sentence first, then use vector distance to detect topic boundaries, ensuring each chunk is a complete semantic unit.
The most overlooked component in a RAG pipeline is chunking. No matter how good the embedding model or how fast the vector database, if chunk boundaries cut through the middle of an argument — splitting a complete thought in two — both halves are incomplete in isolation. Retrieval quality is capped the moment you split the document.
§ 01Part 1 — How fixed-size chunking fails
The most common chunking approach is fixed character count (e.g., 800 chars per chunk, 100-char overlap). Simple to implement, but with a fundamental problem: semantic boundaries have nothing to do with character count.
# Typical fixed-size failure Original text: "...the attention mechanism computes weights via Query, Key, and Value matrices. [800-char boundary cuts here] This design allows the model to process all sequence positions in parallel..." Resulting chunks: Chunk A: ...attention mechanism computes weights via Query, Key, Value matrices. Chunk B: This design allows the model to process all positions in parallel... # User asks: "how does attention enable parallelism?" # Chunk A mentions the mechanism but not parallelism # Chunk B mentions parallelism but not why # Neither chunk can answer the question alone
Overlap is a band-aid, not a fix — it only helps near boundaries, and can't guarantee a complete reasoning unit stays in a single chunk.
§ 02Part 2 — The semantic chunking approach
The core idea: let the embedding model decide where to cut. If two adjacent sentences have a sudden increase in vector distance, a topic shift has occurred — that's a natural boundary.
Deeppin's semantic chunking has three steps:
- Step 1: sentence splitting — break the document into the smallest semantic units (sentences) using punctuation rules
- Step 2: compute cosine distance between adjacent sentence embeddings, find distance spikes (topic boundaries)
- Step 3: merge consecutive sentences within the same topic into one chunk, stopping at topic boundaries or token limits
def semantic_chunk(text: str, model, breakpoint_threshold: float = 0.3) -> list[str]:
sentences = split_sentences(text) # split on 。!?.!? etc.
if len(sentences) <= 1:
return sentences
# Batch embed all sentences
vecs = model.encode(sentences, batch_size=16, normalize_embeddings=True)
# Cosine distance between adjacent sentences
distances = [
1 - float(np.dot(vecs[i], vecs[i + 1]))
for i in range(len(vecs) - 1)
]
# Distance spikes = topic transitions
breakpoints = [i for i, d in enumerate(distances) if d > breakpoint_threshold]
# Merge sentences into chunks by boundary
chunks, current = [], []
for i, sentence in enumerate(sentences):
current.append(sentence)
if i in breakpoints or i == len(sentences) - 1:
chunks.append("".join(current))
current = []
return chunks§ 03Part 3 — Choosing breakpoint_threshold
The threshold controls chunk granularity. Distance = 1 − cosine similarity: 0 means identical, 2 means opposite.
- threshold = 0.2: fine-grained — cuts at every topic shift, many short chunks (~2–3 sentences each)
- threshold = 0.3: balanced — works for most documents (Deeppin's default)
- threshold = 0.5: coarse — only cuts at major topic jumps, fewer longer chunks
Deeppin also adds a hard token cap: even without a topic boundary, chunks exceeding 600 tokens are force-split. Very long chunks dilute the embedding's semantic precision even though bge-m3 supports 8192-token inputs.
MAX_CHUNK_TOKENS = 600
def should_force_break(current_tokens: int, next_sentence: str) -> bool:
return current_tokens + count_tokens(next_sentence) > MAX_CHUNK_TOKENS§ 04Part 4 — Semantic vs. fixed-size: a real comparison
Same technical document (~3,000 words on CAP theorem in distributed systems) Fixed chunking (chunk_size=800, overlap=100): → 4 chunks, avg 750 chars → "consistency" and "the availability trade-off" split across different chunks → Recall rate for "CAP theorem trade-off analysis": ~60% Semantic chunking (threshold=0.3): → 7 chunks, avg 380 chars → Each chunk covers one concept (consistency, availability, partition tolerance, trade-offs) → Recall rate for "CAP theorem trade-off analysis": ~88%
More, shorter chunks look less 'efficient' than fixed chunking, but each chunk's semantic purity is higher — its embedding vector more faithfully represents its content. That's the fundamental factor that determines retrieval quality.
§ 05Part 5 — Deeppin's dual-track RAG
Semantic chunking applies to Track 1 — file attachments. Deeppin has two independent RAG data sources:
Track 1: attachment_chunks (semantic chunking)
Uploaded files are semantically chunked, each chunk embedded and stored. Retrieval does cosine similarity search in pgvector, returning the 4–5 most relevant chunks.
Track 2: conversation_memories (whole-turn embedding)
After each turn, (user message + AI reply) is embedded as a whole and stored — no chunking needed. This lets the current thread recall what other threads in the session discussed. It's the only channel for information to flow between sibling threads.
# Concurrent dual-track retrieval
chunk_res, memory_res = await asyncio.gather(
search_attachment_chunks(query_vec, session_id, top_k=4, threshold=0.45),
search_conversation_memories(
query_vec, session_id, top_k=3, threshold=0.45,
exclude_thread_id=thread_id,
),
)§ 06Part 6 — Retrieval engineering details
Instruction-type query handling
"Summarize this file" — the query vector represents an action, not content. It sits far from file content vectors, falling below the similarity threshold. Fix: detect file-reference keywords, drop threshold to zero:
FILE_REF_PATTERN = re.compile(
r"(file|document|attachment|report|this|just uploaded|summarize|what does it say)",
re.IGNORECASE,
)
is_file_ref = bool(FILE_REF_PATTERN.search(query_text))
threshold = 0.0 if is_file_ref else 0.45Two-layer fallback
- Primary: threshold 0.45 — filters irrelevant results
- Empty result fallback: zero-threshold, force-return top-k — imperfect result beats no result
- Fresh upload: prefer_filename ensures the newly uploaded file's chunks rank above older files, which would otherwise win by sheer volume
§ 07Part 7 — Why local embedding makes semantic chunking viable
Semantic chunking embeds every sentence in the document — 5–10x more embedding calls than fixed-size chunking. With a paid API like OpenAI's text-embedding-3-small, this would multiply costs proportionally. With bge-m3 deployed locally on Oracle ARM, the marginal cost per embedding is zero:
- 1024-dim vectors, optimized for both Chinese and English — suits Deeppin's bilingual context
- 8192-token max input — even long chunks embed in a single call
- Zero API cost — the high call volume from semantic chunking doesn't increase spending
- 570MB model fits comfortably in Oracle ARM's 24GB RAM
§ 08Part 8 — Deeppin's actual implementation
The pseudocode above uses "distance > threshold" to explain the concept. The production code uses cosine similarity (not distance) and character counts (not token counts) as the chunk length unit. Below are the real parameters and core logic running in Deeppin's backend.
Entry routing: inline vs. RAG
Not every uploaded file needs the vector store. Short texts are injected directly as message context, skipping the chunking and embedding overhead:
INLINE_THRESHOLD = 3000 # characters
async def process_attachment(session_id, filename, content):
text = await extract_text(content, filename)
if len(text) <= INLINE_THRESHOLD:
# Short text: inline, skip vector store
return {"chunk_count": 0, "inline_text": text}
# Long text: semantic chunk → embed → store
chunks = await chunk_text_semantic(text)
embeddings = await embed_texts(chunks)
await store_chunks(session_id, filename, chunks, embeddings)Sentence splitting
Split on blank lines first (paragraph boundaries), then on sentence-ending punctuation within each paragraph. Supports both Chinese (。!?) and English (.!?) punctuation:
_SENT_SPLIT_RE = re.compile(r'(?<=[。!?.!?])\s*')
def _split_sentences(text: str) -> list[str]:
sentences = []
for para in re.split(r'\n\s*\n', text): # split on blank lines
para = para.strip()
if not para:
continue
for sent in _SENT_SPLIT_RE.split(para): # split on sentence-end punct
sent = sent.strip()
if sent:
sentences.append(sent)
return sentencesCore semantic chunking logic
Actual parameter table:
SEMANTIC_THRESHOLD = 0.75 # break when adjacent cosine similarity < 0.75 MAX_CHUNK_CHARS = 600 # max characters per chunk MIN_CHUNK_CHARS = 50 # min characters; shorter chunks keep merging
Note: this uses a similarity threshold (similarity < 0.75 → break), not the distance threshold (distance > 0.3 → break) from the pseudocode above. They're equivalent: distance = 1 − similarity, so 0.75 similarity = 0.25 distance. Similarity is more intuitive — "if two sentences aren't similar enough, cut."
async def chunk_text_semantic(text: str) -> list[str]:
sentences = _split_sentences(text)
if len(sentences) <= 1:
return sentences
# Batch embed all sentences (bge-m3 L2-normalized → dot product = cosine sim)
embeddings = await embed_texts(sentences)
chunks: list[str] = []
current: list[str] = [sentences[0]]
current_len: int = len(sentences[0])
for i in range(1, len(sentences)):
sent = sentences[i]
sent_len = len(sent)
# Dot product = cosine similarity (vectors are normalized)
sim = sum(a * b for a, b in zip(embeddings[i-1], embeddings[i]))
# Semantic jump or size overflow → break
# But current chunk must meet MIN_CHUNK_CHARS first (avoid fragments)
should_break = (
sim < SEMANTIC_THRESHOLD or current_len + sent_len > MAX_CHUNK_CHARS
) and current_len >= MIN_CHUNK_CHARS
if should_break:
chunks.append("".join(current))
current = [sent]
current_len = sent_len
else:
current.append(sent)
current_len += sent_len
# Tail handling: merge too-short tail into previous chunk
if current:
tail = "".join(current)
if chunks and len(tail) < MIN_CHUNK_CHARS:
chunks[-1] += tail
else:
chunks.append(tail)
return chunksFallback: fixed-size chunking
If the embedding service is unavailable (model load failure, OOM, etc.), the system automatically degrades to sliding-window chunking:
CHUNK_SIZE = 350 # window size
CHUNK_OVERLAP = 50 # overlap characters
def _chunk_fixed(text: str) -> list[str]:
if len(text) <= CHUNK_SIZE:
return [text]
chunks = []
start = 0
while start < len(text):
end = min(start + CHUNK_SIZE, len(text))
chunks.append(text[start:end])
if end == len(text): break
start = end - CHUNK_OVERLAP
return chunksThe fallback window (350 chars) is smaller than the semantic chunking cap (600 chars), with 50-char overlap. Without semantic boundary guarantees, smaller windows + more overlap partially mitigate the mid-sentence splitting problem — not as precise as semantic chunking, but ensures the system remains functional in degraded scenarios.
Embedding service: singleton + thread pool
bge-m3 is loaded via sentence-transformers with a global singleton and double-checked locking to ensure single initialization. All encode calls run in a thread pool, never blocking the asyncio event loop:
MODEL_NAME = "BAAI/bge-m3" # 1024-dim, Chinese+English bilingual
_model = None
_model_lock = threading.Lock()
def _get_model():
global _model
if _model is None:
with _model_lock: # double-checked locking
if _model is None:
_model = SentenceTransformer(MODEL_NAME)
return _model
def _encode_sync(texts: list[str]) -> list[list[float]]:
model = _get_model()
# normalize_embeddings=True → dot product = cosine similarity
vecs = model.encode(texts, normalize_embeddings=True)
return vecs.tolist()
async def embed_texts(texts: list[str]) -> list[list[float]]:
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, _encode_sync, texts)Context injection: how retrieval results enter the AI
Retrieved chunks and conversation memories are injected as system messages, positioned after summaries and anchors but before conversation history. The final structure (sub-thread example):
[
{"role": "system", "content": "[Main thread summary] ...800 tokens..."},
{"role": "system", "content": "[Depth-1 sub-thread summary] ...500 tokens..."},
{"role": "system", "content": 'Anchor: \"the text the user selected\"'},
{"role": "system", "content": "[RAG] File chunks:\n [report.pdf chunk 3] ...\n [report.pdf chunk 7] ..."},
{"role": "system", "content": "[RAG] Conversation memory:\n User: ...\n AI: ..."},
{"role": "user", "content": "recent 10 messages..."},
{"role": "assistant", "content": "..."},
...
]Total context is capped at 18,000 characters (~7,200 tokens). Oversized user/assistant messages are replaced with placeholders pointing to the RAG system messages; if still over limit, the oldest conversation messages are dropped one by one — system messages (summaries, anchors, RAG) are never removed.