Article · search

Giving Search Capability to Models That Can't Search

How Deeppin uses a SearXNG + LLM pipeline to give any conversation model real-time web search capability — query detection, result filtering, and streaming output.

2026-04-1511 min readsearchSearXNGSSE

Open-source models on Groq have no internet access. But users regularly ask questions requiring real-time information. A two-stage pipeline solves this: search first, then synthesize.

§ 01Part 1 — Query detection: when to trigger search

User explicitly enables web search mode (frontend toggle)
Auto-detection: analyze the question to determine if real-time information is needed

Auto-detection uses two layers: a rule pre-filter, then LLM classification.

Layer 1: rule pre-filter

A regex scan with zero latency. Any match goes straight to the search pipeline, no LLM call needed:

RECENCY_PATTERNS = re.compile(
    r"today|latest|current|right now|just released"
    r"|\d{4}|news|stock price|earnings",
    re.IGNORECASE
)

def quick_check(query: str) -> bool:
    return bool(RECENCY_PATTERNS.search(query))

Layer 2: LLM classification

When rules don't match, a lightweight summarizer-tier model makes a semantic judgment. Rules catch explicit signals but miss implicit recency like 'how did Tesla's latest earnings look?' — LLM semantic understanding fills that gap:

CLASSIFIER_PROMPT = """Does this question require real-time web search?
Answer only yes or no.

Needs search: real-time data, recent events, latest versions, current prices, today's news.
No search needed: concept explanations, code debugging, historical facts, pure reasoning.

Question: {query}
Answer:"""

async def llm_check(query: str) -> bool:
    resp = await router.acompletion(
        model="summarizer",   # lightweight, low latency, doesn't consume chat quota
        messages=[{"role": "user", "content": CLASSIFIER_PROMPT.format(query=query)}],
        max_tokens=3,         # only need yes/no
        temperature=0,
    )
    return resp.choices[0].message.content.strip().lower().startswith("y")

async def should_search(query: str) -> bool:
    if quick_check(query):
        return True
    if len(query) > 10:       # skip LLM call for very short queries
        return await llm_check(query)
    return False

iLLM classification uses the summarizer tier, not the chat tier: ~200–400ms latency, doesn't count against main conversation quota. max_tokens=3 keeps cost negligible — less than 1% of a normal conversation turn. Misclassification costs are asymmetric: missing a needed search gives outdated info; unnecessary search just adds latency. Err toward searching.

§ 02Part 2 — SearXNG: self-hosted meta-search

SearXNG aggregates Google, Bing, DuckDuckGo and others into one API. Hosted on the same Oracle machine as the backend: zero latency, zero cost. Google Search API costs $5/1000 queries — not viable at scale.

§ 03Part 3 — The search pipeline

Step 1: query SearXNG

resp = await client.get(f"{SEARXNG_URL}/search", params={
    "q": query, "format": "json",
    "engines": "google,bing,duckduckgo",
    "time_range": "month",
})

Step 2: filter and clean

results = resp.json().get("results", [])[:8]

filtered = [
    {"title": r.get("title", "")[:100],
     "url": r.get("url", ""),
     "content": r.get("content", "")[:400]}
    for r in results if r.get("content")
]

Step 3: LLM synthesis

Format results into a system message, let the LLM synthesize a coherent answer with source citations [1], [2], etc.:

search_context = "\n\n".join([
    f"[{i+1}] {r['title']}\n{r['url']}\n{r['content']}"
    for i, r in enumerate(filtered)
])

§ 04Part 4 — Streaming output

Search + LLM synthesis goes through the SSE endpoint. Users see a 'Searching...' indicator first, then the LLM answer streams token by token:

async def search_stream(query):
    yield sse_event("status", "Searching...")
    results = await search_searxng(query)
    async for chunk in llm_stream(build_search_prompt(query, results)):
        yield sse_event("token", chunk)

§ 05Part 5 — Graceful degradation

When SearXNG is unavailable (server down, network issues), the conversation never fails. It degrades to a pure LLM response with a disclaimer:

try:
    results = await search_searxng(query)
except Exception:
    results = []
    system_note = "(Note: web search temporarily unavailable, answering from model knowledge.)"