Skip to main content

Technical Blog

Compound AI Architecture: Semantic Routing for 90% Inference Cost Reduction

llmslminferenceroutingarchitectureneo4joptimizationenterprise

How to build a three-tier inference stack that routes user queries between local SLMs and frontier APIs using semantic classification, with Python implementation, Neo4j-backed domain routing, and a realistic cost model for B2B workloads.

The death of the monolithic LLM call

Every POST /v1/chat/completions that hits a frontier API carries the same implicit assumption: this query deserves 200B+ parameters of dense attention, cross-datacenter latency, and $15–60 per million output tokens. For the median production workload—FAQ deflection, entity extraction, document classification, slot-filling in a multi-turn agent—that assumption is empirically false and financially ruinous at scale.

A 10,000-seat B2B SaaS product with an AI copilot generating ~50 LLM calls per user per day burns through 500K calls daily. At an average of 800 output tokens per call on a frontier model priced at $30/M output tokens, that is $12,000/day—$4.38M annualized—before you account for input tokens and retries. Most of those calls are asking a 400B-parameter model to do work a 3B checkpoint could handle in 40ms on a $300 mini-PC.

The monolithic call pattern is not a reasonable default. It is an anti-pattern: it couples latency SLOs to a vendor's batching queue, cost structure to a pricing page you do not control, and data residency to a jurisdiction you cannot audit. The compound alternative is straightforward: classify, then route.

Enter semantic routing

Semantic routing is the practice of mapping an incoming query to an inference tier before execution, using a lightweight classifier that operates on the query's embedding representation. The router's job is not to understand the query—it is to estimate how hard the query is along axes that map to your tiered backend:

  • Lexical/factual retrieval: the answer exists verbatim in your knowledge base. Cache or SLM territory.
  • Structured extraction: the task is constrained by a schema (JSON, tool call, slot fill). SLM with guided decoding.
  • Multi-hop reasoning: the query requires chaining facts, comparing entities, or synthesizing across documents. Frontier territory.
  • Out-of-domain / ambiguous: the query falls outside your system's scope. Reject or escalate with explicit UX.

The key insight is that the embedding space of a small model already encodes enough signal to discriminate these classes. You do not need a frontier model to decide whether to call a frontier model. A 384-dimensional embedding from a 30M-parameter encoder, plus a trained linear head or a nearest-centroid classifier, makes the routing decision in under 2ms on CPU.

Why embeddings, not keyword rules

Rule-based routers ("if the query contains 'compare' or 'analyze', route to GPT-5") are brittle, high-maintenance, and blind to paraphrase. Embedding-based routing captures semantic similarity to labeled exemplars: you define clusters of queries per tier, embed them, and at inference time, classify the new query by proximity. This generalizes across phrasing, survives typos, and lets you retrain the boundary by adding a few dozen labeled examples—no regex archaeology required.

The architecture: three tiers

User QueryEmbedding Model + Router Head (30M params, CPU, under 2ms) → route to tier:

Tier 1: Reject / CacheTier 2: Local SLMTier 3: Frontier API
Latencyunder 1ms20–80ms500–3000ms
Cost per call~$0~$0.001~$0.02–0.06
Traffic share~20%~72%~8%
Runs onIn-memory cacheOn-prem CPU/GPUVendor API

Tier 1: reject or cache hit

Before any model runs, check two things:

  1. Scope guard. If the router's confidence across all in-scope clusters falls below a threshold, the query is out-of-domain. Return a canned response or escalate to a human. This prevents both wasted compute and hallucinated answers on topics your system was never designed to handle.

  2. Semantic cache. Hash the query embedding (locality-sensitive hashing or quantized vector lookup) against a cache of recent query–response pairs. If the cosine similarity to a cached embedding exceeds a threshold (typically 0.95–0.98 depending on your tolerance), return the cached response. For high-volume B2B workloads with repetitive internal queries, cache hit rates of 15–30% are common and eliminate inference entirely.

Cost: effectively zero. Latency: sub-millisecond after embedding.

Tier 2: local SLM execution

Queries classified as structurally simple—FAQ answers, entity extraction, classification, reformulation, tool-call argument filling—route to a locally hosted small language model.

The choice of checkpoint matters. As I covered in a previous analysis of the SLM landscape, models in the 1B–8B range (Qwen2.5, Phi-4-mini, Llama-edge variants) achieve production-grade quality on constrained tasks when distilled from frontier teachers and served with schema-guided decoding. For latency-critical paths where even INT4 GPU inference is overkill, ternary 1.58-bit models served via BitNet.cpp push decode to pure CPU SIMD with sub-50ms first-token latency on commodity hardware.

Hardware budget for Tier 2 is modest: a single node with 32GB RAM and a mid-range CPU (or an RTX 4060-class GPU for INT4 serving) handles hundreds of concurrent SLM requests. The critical constraint is memory bandwidth, not FLOPS—the same roofline story that makes SLMs viable in the first place.

Tier 3: frontier API escalation

Only queries that the router classifies as requiring multi-hop reasoning, long-range synthesis, or creative generation reach the frontier API. The design goal is that this tier handles ≤10% of total traffic. Every escalation should be logged with the router's confidence score and the downstream task outcome, feeding a flywheel that tightens the routing boundary over time.

Tier 3 calls should be async by default in user-facing flows. If the user is waiting, show a progress indicator and stream partial results. If the call is part of an agent loop, decouple it from the synchronous turn so the rest of the pipeline is not blocked on a 2-second API round trip.

Code implementation

The router

The following implementation uses a lightweight sentence-transformer for embeddings and a nearest-centroid classifier trained on labeled exemplars. This is intentionally minimal—production systems should add calibration, confidence thresholds per tier, and A/B logging.

import numpy as np
from dataclasses import dataclass, field
from enum import Enum
from sentence_transformers import SentenceTransformer
 
 
class Tier(Enum):
    REJECT = "reject"
    CACHE = "cache"
    SLM = "slm"
    FRONTIER = "frontier"
 
 
@dataclass
class RouteDecision:
    tier: Tier
    confidence: float
    latency_ms: float
    metadata: dict = field(default_factory=dict)
 
 
class SemanticRouter:
    """Routes queries to inference tiers using embedding similarity
    to labeled exemplar centroids. No fine-tuning required—just
    representative examples per tier."""
 
    def __init__(
        self,
        model_name: str = "all-MiniLM-L6-v2",
        confidence_floor: float = 0.35,
        cache_similarity_threshold: float = 0.96,
    ):
        self.encoder = SentenceTransformer(model_name)
        self.confidence_floor = confidence_floor
        self.cache_threshold = cache_similarity_threshold
        self.centroids: dict[Tier, np.ndarray] = {}
        self.cache_embeddings: np.ndarray | None = None
        self.cache_responses: list[str] = []
 
    def fit(self, exemplars: dict[Tier, list[str]]) -> None:
        """Compute centroid embedding per tier from labeled examples."""
        for tier, texts in exemplars.items():
            embeddings = self.encoder.encode(texts, normalize_embeddings=True)
            self.centroids[tier] = np.mean(embeddings, axis=0)
            self.centroids[tier] /= np.linalg.norm(self.centroids[tier])
 
    def route(self, query: str) -> RouteDecision:
        import time
        t0 = time.perf_counter()
 
        q_emb = self.encoder.encode([query], normalize_embeddings=True)[0]
 
        if self.cache_embeddings is not None and len(self.cache_embeddings) > 0:
            sims = self.cache_embeddings @ q_emb
            best_idx = int(np.argmax(sims))
            if sims[best_idx] >= self.cache_threshold:
                return RouteDecision(
                    tier=Tier.CACHE,
                    confidence=float(sims[best_idx]),
                    latency_ms=(time.perf_counter() - t0) * 1000,
                    metadata={"cached_response": self.cache_responses[best_idx]},
                )
 
        scores = {
            tier: float(centroid @ q_emb)
            for tier, centroid in self.centroids.items()
        }
        best_tier = max(scores, key=scores.get)
        best_score = scores[best_tier]
 
        if best_score < self.confidence_floor:
            best_tier = Tier.REJECT
 
        return RouteDecision(
            tier=best_tier,
            confidence=best_score,
            latency_ms=(time.perf_counter() - t0) * 1000,
            metadata={"all_scores": {t.value: s for t, s in scores.items()}},
        )
 
    def update_cache(self, query: str, response: str) -> None:
        emb = self.encoder.encode([query], normalize_embeddings=True)
        if self.cache_embeddings is None:
            self.cache_embeddings = emb
        else:
            self.cache_embeddings = np.vstack([self.cache_embeddings, emb])
        self.cache_responses.append(response)

Usage: define exemplars per tier from your actual query logs, fit the router, and call route() on every incoming request. The entire routing decision—embedding + centroid comparison + cache check—runs in 1–3ms on a single CPU core with the 22M-parameter all-MiniLM-L6-v2 encoder.

router = SemanticRouter()
 
router.fit({
    Tier.SLM: [
        "What is the return policy?",
        "Reset my password",
        "Show me my last invoice",
        "What are your business hours?",
        "Extract the company name from this email",
    ],
    Tier.FRONTIER: [
        "Compare our Q3 churn rate across segments and suggest retention strategies",
        "Analyze the root cause of the latency spike across these three services",
        "Draft a technical proposal for migrating our auth system to OIDC",
        "Given these five vendor quotes, recommend the best option with trade-offs",
    ],
})
 
decision = router.route("What time do you close on Fridays?")
# RouteDecision(tier=Tier.SLM, confidence=0.71, latency_ms=1.8, ...)

Neo4j-backed domain routing

For B2B systems where queries span multiple product domains, a flat embedding classifier is insufficient. The router needs domain awareness: which product, which customer segment, which compliance context applies to this query. This is where a knowledge graph earns its place in the routing layer—not in the generation layer (that comes later), but in the decision layer.

If you followed the GraphRAG pipeline walkthrough, the pattern is familiar: query the graph for structural metadata, then use it to condition the routing decision.

from neo4j import GraphDatabase
 
 
class DomainRouter:
    """Augments semantic routing with domain metadata from a
    Neo4j knowledge graph. The graph stores product domains,
    their complexity tiers, and known query patterns."""
 
    def __init__(self, neo4j_uri: str, neo4j_auth: tuple[str, str]):
        self._driver = GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
 
    def close(self) -> None:
        self._driver.close()
 
    def get_domain_context(self, query_embedding: list[float], top_k: int = 3) -> dict:
        """Find the closest domain nodes using Neo4j's vector index,
        then traverse to get complexity metadata and routing hints."""
        cypher = """
        CALL db.index.vector.queryNodes('domain_embeddings', $top_k, $embedding)
        YIELD node, score
        MATCH (node)-[:HAS_COMPLEXITY]->(c:ComplexityTier)
        OPTIONAL MATCH (node)-[:REQUIRES]->(cap:Capability)
        RETURN node.name AS domain,
               node.description AS description,
               c.tier AS complexity_tier,
               c.recommended_model AS recommended_model,
               collect(DISTINCT cap.name) AS required_capabilities,
               score
        ORDER BY score DESC
        """
        with self._driver.session() as session:
            records = session.run(
                cypher, embedding=query_embedding, top_k=top_k
            )
            results = []
            for r in records:
                results.append({
                    "domain": r["domain"],
                    "complexity_tier": r["complexity_tier"],
                    "recommended_model": r["recommended_model"],
                    "required_capabilities": r["required_capabilities"],
                    "similarity": r["score"],
                })
            return results[0] if results else {}
 
    def override_tier(self, semantic_decision: "RouteDecision", domain_ctx: dict) -> "RouteDecision":
        """Let graph metadata override the semantic router when domain
        knowledge provides a stronger signal. E.g., a query about a
        compliance-heavy domain escalates to frontier regardless of
        surface-level simplicity."""
        if not domain_ctx:
            return semantic_decision
 
        graph_tier = domain_ctx.get("complexity_tier", "slm")
        capabilities = domain_ctx.get("required_capabilities", [])
 
        if "multi_hop_reasoning" in capabilities or "regulatory_compliance" in capabilities:
            semantic_decision.tier = Tier.FRONTIER
            semantic_decision.metadata["override_reason"] = "domain_requires_frontier"
            semantic_decision.metadata["domain"] = domain_ctx["domain"]
 
        elif graph_tier == "slm" and semantic_decision.tier == Tier.FRONTIER:
            if semantic_decision.confidence < 0.55:
                semantic_decision.tier = Tier.SLM
                semantic_decision.metadata["override_reason"] = "graph_demoted_to_slm"
 
        return semantic_decision

The graph schema here is intentionally opinionated: Domain nodes carry vector embeddings (indexed via Neo4j's native vector index), linked to ComplexityTier nodes that encode the recommended inference tier and Capability nodes that flag when a domain requires specific reasoning patterns. This structure lets you encode institutional knowledge about your product into the routing layer—knowledge that a pure embedding classifier cannot learn from query text alone.

Composing the pipeline

def handle_query(query: str, router: SemanticRouter, domain_router: DomainRouter) -> dict:
    decision = router.route(query)
 
    if decision.tier == Tier.CACHE:
        return {"response": decision.metadata["cached_response"], "tier": "cache"}
 
    if decision.tier == Tier.REJECT:
        return {"response": "This question is outside my scope.", "tier": "reject"}
 
    q_emb = router.encoder.encode([query], normalize_embeddings=True)[0]
    domain_ctx = domain_router.get_domain_context(q_emb.tolist())
    decision = domain_router.override_tier(decision, domain_ctx)
 
    if decision.tier == Tier.SLM:
        response = call_local_slm(query, model="qwen2.5-3b-q4")
        router.update_cache(query, response)
        return {"response": response, "tier": "slm", "domain": domain_ctx}
 
    response = call_frontier_api(query, model="gpt-5")
    router.update_cache(query, response)
    return {"response": response, "tier": "frontier", "domain": domain_ctx}

The economics

Let us model a concrete B2B scenario: a customer support copilot handling 500,000 queries per day, average 200 input tokens and 800 output tokens per query.

Baseline: monolithic frontier API

ComponentValue
Daily queries500,000
Avg input tokens200
Avg output tokens800
Frontier input price$10 / M tokens
Frontier output price$30 / M tokens
Daily input cost$1,000
Daily output cost$12,000
Daily total$13,000
Annual total$4,745,000

Compound architecture with semantic routing

Assume the router achieves the following traffic distribution (conservative, based on observed B2B query patterns):

TierTraffic sharePer-call costDaily callsDaily cost
Tier 1: cache/reject20%~$0100,000$0
Tier 2: local SLM72%~$0.0008*360,000$288
Tier 3: frontier API8%~$0.02640,000$1,040
Router overhead100%~$0.00002500,000$10
Daily total$1,338
Annual total$488,370

*Tier 2 cost estimated as amortized hardware (a 2-node CPU cluster at ~$800/mo serving Qwen2.5-3B-Q4 via vLLM or llama.cpp) divided by daily throughput. No per-token API fee.

ROI summary

MetricValue
Annual savings$4,256,630
Cost reduction89.7%
Infrastructure investment~$20,000 (2× CPU nodes + Neo4j instance)
Engineering effort2–4 weeks for a senior ML engineer
Payback periodUnder 2 days of operation

The numbers shift with your specific query distribution, but the structural argument holds: if more than 60% of your queries are handleable by a 3B-parameter model (and in most B2B support/copilot workloads, the number is closer to 75%), the compound architecture dominates on pure economics before you account for the latency and privacy benefits of local execution.

Hardware constraints worth noting

  • Memory bandwidth is the bottleneck for Tier 2, not FLOPS. A Qwen2.5-3B in INT4 fits in ~2GB of RAM, but decode throughput is gated by how fast you can stream weights through the memory hierarchy. On DDR5-5600 (~90 GB/s theoretical), expect ~60–80 tokens/sec for batch-1 decode. For higher throughput, batch requests or add nodes—do not buy a bigger GPU.

  • The router's embedding model must stay hot in L3 cache. The all-MiniLM-L6-v2 checkpoint is ~90MB. On a modern server CPU with 32MB+ L3, the working set fits comfortably, keeping inference at 1–2ms. If you scale to a larger encoder (e.g., 110M-parameter BGE), expect 5–8ms routing latency and plan accordingly.

  • Neo4j vector index performance scales with the number of domain nodes. For most B2B deployments (hundreds to low thousands of domain nodes), query latency is sub-10ms. If your graph grows to millions of nodes, consider a dedicated vector store for the routing layer and keep Neo4j for the structural traversals.

Closing

The compound AI architecture is not a research proposal—it is a deployment pattern that any team with access to a knowledge graph, a sentence-transformer, and a locally served SLM can ship in weeks. The engineering is unglamorous: labeled exemplars, centroid classifiers, a Cypher query for domain metadata, and a routing function that fits in 50 lines. The impact is not: you reclaim 90% of your inference budget and gain latency, privacy, and failure isolation properties that no single-vendor API call can provide.

If you are a CTO or VP of Engineering evaluating this pattern for your stack, the decision framework is simple. Audit your query logs. Classify a random sample of 1,000 queries by the tier that could handle them. If more than half land in Tier 2, the compound architecture pays for itself before the quarter ends. The harder question is not whether to build it, but how aggressively to shift traffic off the frontier tier—and that is a calibration problem, not an architecture problem.