Giving Search Capability to Models That Can't Search
How Deeppin uses a SearXNG + LLM pipeline to give any conversation model real-time web search capability — query detection, result filtering, and streaming output.
Open-source models on Groq have no internet access. But users regularly ask questions requiring real-time information. A two-stage pipeline solves this: search first, then synthesize.
§ 01Part 1 — Query detection: when to trigger search
- User explicitly enables web search mode (frontend toggle)
- Auto-detection: analyze the question to determine if real-time information is needed
Auto-detection uses two layers: a rule pre-filter, then LLM classification.
Layer 1: rule pre-filter
A regex scan with zero latency. Any match goes straight to the search pipeline, no LLM call needed:
RECENCY_PATTERNS = re.compile(
r"today|latest|current|right now|just released"
r"|\d{4}|news|stock price|earnings",
re.IGNORECASE
)
def quick_check(query: str) -> bool:
return bool(RECENCY_PATTERNS.search(query))Layer 2: LLM classification
When rules don't match, a lightweight summarizer-tier model makes a semantic judgment. Rules catch explicit signals but miss implicit recency like 'how did Tesla's latest earnings look?' — LLM semantic understanding fills that gap:
CLASSIFIER_PROMPT = """Does this question require real-time web search?
Answer only yes or no.
Needs search: real-time data, recent events, latest versions, current prices, today's news.
No search needed: concept explanations, code debugging, historical facts, pure reasoning.
Question: {query}
Answer:"""
async def llm_check(query: str) -> bool:
resp = await router.acompletion(
model="summarizer", # lightweight, low latency, doesn't consume chat quota
messages=[{"role": "user", "content": CLASSIFIER_PROMPT.format(query=query)}],
max_tokens=3, # only need yes/no
temperature=0,
)
return resp.choices[0].message.content.strip().lower().startswith("y")
async def should_search(query: str) -> bool:
if quick_check(query):
return True
if len(query) > 10: # skip LLM call for very short queries
return await llm_check(query)
return False§ 02Part 2 — SearXNG: self-hosted meta-search
SearXNG aggregates Google, Bing, DuckDuckGo and others into one API. Hosted on the same Oracle machine as the backend: zero latency, zero cost. Google Search API costs $5/1000 queries — not viable at scale.
§ 03Part 3 — The search pipeline
Step 1: query SearXNG
resp = await client.get(f"{SEARXNG_URL}/search", params={
"q": query, "format": "json",
"engines": "google,bing,duckduckgo",
"time_range": "month",
})Step 2: filter and clean
results = resp.json().get("results", [])[:8]
filtered = [
{"title": r.get("title", "")[:100],
"url": r.get("url", ""),
"content": r.get("content", "")[:400]}
for r in results if r.get("content")
]Step 3: LLM synthesis
Format results into a system message, let the LLM synthesize a coherent answer with source citations [1], [2], etc.:
search_context = "\n\n".join([
f"[{i+1}] {r['title']}\n{r['url']}\n{r['content']}"
for i, r in enumerate(filtered)
])§ 04Part 4 — Streaming output
Search + LLM synthesis goes through the SSE endpoint. Users see a 'Searching...' indicator first, then the LLM answer streams token by token:
async def search_stream(query):
yield sse_event("status", "Searching...")
results = await search_searxng(query)
async for chunk in llm_stream(build_search_prompt(query, results)):
yield sse_event("token", chunk)§ 05Part 5 — Graceful degradation
When SearXNG is unavailable (server down, network issues), the conversation never fails. It degrades to a pure LLM response with a disclaimer:
try:
results = await search_searxng(query)
except Exception:
results = []
system_note = "(Note: web search temporarily unavailable, answering from model knowledge.)"