Technical Blog
Compound AI Architecture: Semantic Routing for 90% Inference Cost Reduction
How to build a three-tier inference stack that routes user queries between local SLMs and frontier APIs using semantic classification, with Python implementation, Neo4j-backed domain routing, and a realistic cost model for B2B workloads.
The death of the monolithic LLM call
Every POST /v1/chat/completions that hits a frontier API carries the same implicit assumption: this query deserves 200B+ parameters of dense attention, cross-datacenter latency, and $15–60 per million output tokens. For the median production workload—FAQ deflection, entity extraction, document classification, slot-filling in a multi-turn agent—that assumption is empirically false and financially ruinous at scale.
A 10,000-seat B2B SaaS product with an AI copilot generating ~50 LLM calls per user per day burns through 500K calls daily. At an average of 800 output tokens per call on a frontier model priced at $30/M output tokens, that is $12,000/day—$4.38M annualized—before you account for input tokens and retries. Most of those calls are asking a 400B-parameter model to do work a 3B checkpoint could handle in 40ms on a $300 mini-PC.
The monolithic call pattern is not a reasonable default. It is an anti-pattern: it couples latency SLOs to a vendor's batching queue, cost structure to a pricing page you do not control, and data residency to a jurisdiction you cannot audit. The compound alternative is straightforward: classify, then route.
Enter semantic routing
Semantic routing is the practice of mapping an incoming query to an inference tier before execution, using a lightweight classifier that operates on the query's embedding representation. The router's job is not to understand the query—it is to estimate how hard the query is along axes that map to your tiered backend:
- Lexical/factual retrieval: the answer exists verbatim in your knowledge base. Cache or SLM territory.
- Structured extraction: the task is constrained by a schema (JSON, tool call, slot fill). SLM with guided decoding.
- Multi-hop reasoning: the query requires chaining facts, comparing entities, or synthesizing across documents. Frontier territory.
- Out-of-domain / ambiguous: the query falls outside your system's scope. Reject or escalate with explicit UX.
The key insight is that the embedding space of a small model already encodes enough signal to discriminate these classes. You do not need a frontier model to decide whether to call a frontier model. A 384-dimensional embedding from a 30M-parameter encoder, plus a trained linear head or a nearest-centroid classifier, makes the routing decision in under 2ms on CPU.
Why embeddings, not keyword rules
Rule-based routers ("if the query contains 'compare' or 'analyze', route to GPT-5") are brittle, high-maintenance, and blind to paraphrase. Embedding-based routing captures semantic similarity to labeled exemplars: you define clusters of queries per tier, embed them, and at inference time, classify the new query by proximity. This generalizes across phrasing, survives typos, and lets you retrain the boundary by adding a few dozen labeled examples—no regex archaeology required.
The architecture: three tiers
User Query → Embedding Model + Router Head (30M params, CPU, under 2ms) → route to tier:
| Tier 1: Reject / Cache | Tier 2: Local SLM | Tier 3: Frontier API | |
|---|---|---|---|
| Latency | under 1ms | 20–80ms | 500–3000ms |
| Cost per call | ~$0 | ~$0.001 | ~$0.02–0.06 |
| Traffic share | ~20% | ~72% | ~8% |
| Runs on | In-memory cache | On-prem CPU/GPU | Vendor API |
Tier 1: reject or cache hit
Before any model runs, check two things:
-
Scope guard. If the router's confidence across all in-scope clusters falls below a threshold, the query is out-of-domain. Return a canned response or escalate to a human. This prevents both wasted compute and hallucinated answers on topics your system was never designed to handle.
-
Semantic cache. Hash the query embedding (locality-sensitive hashing or quantized vector lookup) against a cache of recent query–response pairs. If the cosine similarity to a cached embedding exceeds a threshold (typically 0.95–0.98 depending on your tolerance), return the cached response. For high-volume B2B workloads with repetitive internal queries, cache hit rates of 15–30% are common and eliminate inference entirely.
Cost: effectively zero. Latency: sub-millisecond after embedding.
Tier 2: local SLM execution
Queries classified as structurally simple—FAQ answers, entity extraction, classification, reformulation, tool-call argument filling—route to a locally hosted small language model.
The choice of checkpoint matters. As I covered in a previous analysis of the SLM landscape, models in the 1B–8B range (Qwen2.5, Phi-4-mini, Llama-edge variants) achieve production-grade quality on constrained tasks when distilled from frontier teachers and served with schema-guided decoding. For latency-critical paths where even INT4 GPU inference is overkill, ternary 1.58-bit models served via BitNet.cpp push decode to pure CPU SIMD with sub-50ms first-token latency on commodity hardware.
Hardware budget for Tier 2 is modest: a single node with 32GB RAM and a mid-range CPU (or an RTX 4060-class GPU for INT4 serving) handles hundreds of concurrent SLM requests. The critical constraint is memory bandwidth, not FLOPS—the same roofline story that makes SLMs viable in the first place.
Tier 3: frontier API escalation
Only queries that the router classifies as requiring multi-hop reasoning, long-range synthesis, or creative generation reach the frontier API. The design goal is that this tier handles ≤10% of total traffic. Every escalation should be logged with the router's confidence score and the downstream task outcome, feeding a flywheel that tightens the routing boundary over time.
Tier 3 calls should be async by default in user-facing flows. If the user is waiting, show a progress indicator and stream partial results. If the call is part of an agent loop, decouple it from the synchronous turn so the rest of the pipeline is not blocked on a 2-second API round trip.
Code implementation
The router
The following implementation uses a lightweight sentence-transformer for embeddings and a nearest-centroid classifier trained on labeled exemplars. This is intentionally minimal—production systems should add calibration, confidence thresholds per tier, and A/B logging.
import numpy as np
from dataclasses import dataclass, field
from enum import Enum
from sentence_transformers import SentenceTransformer
class Tier(Enum):
REJECT = "reject"
CACHE = "cache"
SLM = "slm"
FRONTIER = "frontier"
@dataclass
class RouteDecision:
tier: Tier
confidence: float
latency_ms: float
metadata: dict = field(default_factory=dict)
class SemanticRouter:
"""Routes queries to inference tiers using embedding similarity
to labeled exemplar centroids. No fine-tuning required—just
representative examples per tier."""
def __init__(
self,
model_name: str = "all-MiniLM-L6-v2",
confidence_floor: float = 0.35,
cache_similarity_threshold: float = 0.96,
):
self.encoder = SentenceTransformer(model_name)
self.confidence_floor = confidence_floor
self.cache_threshold = cache_similarity_threshold
self.centroids: dict[Tier, np.ndarray] = {}
self.cache_embeddings: np.ndarray | None = None
self.cache_responses: list[str] = []
def fit(self, exemplars: dict[Tier, list[str]]) -> None:
"""Compute centroid embedding per tier from labeled examples."""
for tier, texts in exemplars.items():
embeddings = self.encoder.encode(texts, normalize_embeddings=True)
self.centroids[tier] = np.mean(embeddings, axis=0)
self.centroids[tier] /= np.linalg.norm(self.centroids[tier])
def route(self, query: str) -> RouteDecision:
import time
t0 = time.perf_counter()
q_emb = self.encoder.encode([query], normalize_embeddings=True)[0]
if self.cache_embeddings is not None and len(self.cache_embeddings) > 0:
sims = self.cache_embeddings @ q_emb
best_idx = int(np.argmax(sims))
if sims[best_idx] >= self.cache_threshold:
return RouteDecision(
tier=Tier.CACHE,
confidence=float(sims[best_idx]),
latency_ms=(time.perf_counter() - t0) * 1000,
metadata={"cached_response": self.cache_responses[best_idx]},
)
scores = {
tier: float(centroid @ q_emb)
for tier, centroid in self.centroids.items()
}
best_tier = max(scores, key=scores.get)
best_score = scores[best_tier]
if best_score < self.confidence_floor:
best_tier = Tier.REJECT
return RouteDecision(
tier=best_tier,
confidence=best_score,
latency_ms=(time.perf_counter() - t0) * 1000,
metadata={"all_scores": {t.value: s for t, s in scores.items()}},
)
def update_cache(self, query: str, response: str) -> None:
emb = self.encoder.encode([query], normalize_embeddings=True)
if self.cache_embeddings is None:
self.cache_embeddings = emb
else:
self.cache_embeddings = np.vstack([self.cache_embeddings, emb])
self.cache_responses.append(response)Usage: define exemplars per tier from your actual query logs, fit the router, and call route() on every incoming request. The entire routing decision—embedding + centroid comparison + cache check—runs in 1–3ms on a single CPU core with the 22M-parameter all-MiniLM-L6-v2 encoder.
router = SemanticRouter()
router.fit({
Tier.SLM: [
"What is the return policy?",
"Reset my password",
"Show me my last invoice",
"What are your business hours?",
"Extract the company name from this email",
],
Tier.FRONTIER: [
"Compare our Q3 churn rate across segments and suggest retention strategies",
"Analyze the root cause of the latency spike across these three services",
"Draft a technical proposal for migrating our auth system to OIDC",
"Given these five vendor quotes, recommend the best option with trade-offs",
],
})
decision = router.route("What time do you close on Fridays?")
# RouteDecision(tier=Tier.SLM, confidence=0.71, latency_ms=1.8, ...)Neo4j-backed domain routing
For B2B systems where queries span multiple product domains, a flat embedding classifier is insufficient. The router needs domain awareness: which product, which customer segment, which compliance context applies to this query. This is where a knowledge graph earns its place in the routing layer—not in the generation layer (that comes later), but in the decision layer.
If you followed the GraphRAG pipeline walkthrough, the pattern is familiar: query the graph for structural metadata, then use it to condition the routing decision.
from neo4j import GraphDatabase
class DomainRouter:
"""Augments semantic routing with domain metadata from a
Neo4j knowledge graph. The graph stores product domains,
their complexity tiers, and known query patterns."""
def __init__(self, neo4j_uri: str, neo4j_auth: tuple[str, str]):
self._driver = GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
def close(self) -> None:
self._driver.close()
def get_domain_context(self, query_embedding: list[float], top_k: int = 3) -> dict:
"""Find the closest domain nodes using Neo4j's vector index,
then traverse to get complexity metadata and routing hints."""
cypher = """
CALL db.index.vector.queryNodes('domain_embeddings', $top_k, $embedding)
YIELD node, score
MATCH (node)-[:HAS_COMPLEXITY]->(c:ComplexityTier)
OPTIONAL MATCH (node)-[:REQUIRES]->(cap:Capability)
RETURN node.name AS domain,
node.description AS description,
c.tier AS complexity_tier,
c.recommended_model AS recommended_model,
collect(DISTINCT cap.name) AS required_capabilities,
score
ORDER BY score DESC
"""
with self._driver.session() as session:
records = session.run(
cypher, embedding=query_embedding, top_k=top_k
)
results = []
for r in records:
results.append({
"domain": r["domain"],
"complexity_tier": r["complexity_tier"],
"recommended_model": r["recommended_model"],
"required_capabilities": r["required_capabilities"],
"similarity": r["score"],
})
return results[0] if results else {}
def override_tier(self, semantic_decision: "RouteDecision", domain_ctx: dict) -> "RouteDecision":
"""Let graph metadata override the semantic router when domain
knowledge provides a stronger signal. E.g., a query about a
compliance-heavy domain escalates to frontier regardless of
surface-level simplicity."""
if not domain_ctx:
return semantic_decision
graph_tier = domain_ctx.get("complexity_tier", "slm")
capabilities = domain_ctx.get("required_capabilities", [])
if "multi_hop_reasoning" in capabilities or "regulatory_compliance" in capabilities:
semantic_decision.tier = Tier.FRONTIER
semantic_decision.metadata["override_reason"] = "domain_requires_frontier"
semantic_decision.metadata["domain"] = domain_ctx["domain"]
elif graph_tier == "slm" and semantic_decision.tier == Tier.FRONTIER:
if semantic_decision.confidence < 0.55:
semantic_decision.tier = Tier.SLM
semantic_decision.metadata["override_reason"] = "graph_demoted_to_slm"
return semantic_decisionThe graph schema here is intentionally opinionated: Domain nodes carry vector embeddings (indexed via Neo4j's native vector index), linked to ComplexityTier nodes that encode the recommended inference tier and Capability nodes that flag when a domain requires specific reasoning patterns. This structure lets you encode institutional knowledge about your product into the routing layer—knowledge that a pure embedding classifier cannot learn from query text alone.
Composing the pipeline
def handle_query(query: str, router: SemanticRouter, domain_router: DomainRouter) -> dict:
decision = router.route(query)
if decision.tier == Tier.CACHE:
return {"response": decision.metadata["cached_response"], "tier": "cache"}
if decision.tier == Tier.REJECT:
return {"response": "This question is outside my scope.", "tier": "reject"}
q_emb = router.encoder.encode([query], normalize_embeddings=True)[0]
domain_ctx = domain_router.get_domain_context(q_emb.tolist())
decision = domain_router.override_tier(decision, domain_ctx)
if decision.tier == Tier.SLM:
response = call_local_slm(query, model="qwen2.5-3b-q4")
router.update_cache(query, response)
return {"response": response, "tier": "slm", "domain": domain_ctx}
response = call_frontier_api(query, model="gpt-5")
router.update_cache(query, response)
return {"response": response, "tier": "frontier", "domain": domain_ctx}The economics
Let us model a concrete B2B scenario: a customer support copilot handling 500,000 queries per day, average 200 input tokens and 800 output tokens per query.
Baseline: monolithic frontier API
| Component | Value |
|---|---|
| Daily queries | 500,000 |
| Avg input tokens | 200 |
| Avg output tokens | 800 |
| Frontier input price | $10 / M tokens |
| Frontier output price | $30 / M tokens |
| Daily input cost | $1,000 |
| Daily output cost | $12,000 |
| Daily total | $13,000 |
| Annual total | $4,745,000 |
Compound architecture with semantic routing
Assume the router achieves the following traffic distribution (conservative, based on observed B2B query patterns):
| Tier | Traffic share | Per-call cost | Daily calls | Daily cost |
|---|---|---|---|---|
| Tier 1: cache/reject | 20% | ~$0 | 100,000 | $0 |
| Tier 2: local SLM | 72% | ~$0.0008* | 360,000 | $288 |
| Tier 3: frontier API | 8% | ~$0.026 | 40,000 | $1,040 |
| Router overhead | 100% | ~$0.00002 | 500,000 | $10 |
| Daily total | $1,338 | |||
| Annual total | $488,370 |
*Tier 2 cost estimated as amortized hardware (a 2-node CPU cluster at ~$800/mo serving Qwen2.5-3B-Q4 via vLLM or llama.cpp) divided by daily throughput. No per-token API fee.
ROI summary
| Metric | Value |
|---|---|
| Annual savings | $4,256,630 |
| Cost reduction | 89.7% |
| Infrastructure investment | ~$20,000 (2× CPU nodes + Neo4j instance) |
| Engineering effort | 2–4 weeks for a senior ML engineer |
| Payback period | Under 2 days of operation |
The numbers shift with your specific query distribution, but the structural argument holds: if more than 60% of your queries are handleable by a 3B-parameter model (and in most B2B support/copilot workloads, the number is closer to 75%), the compound architecture dominates on pure economics before you account for the latency and privacy benefits of local execution.
Hardware constraints worth noting
-
Memory bandwidth is the bottleneck for Tier 2, not FLOPS. A Qwen2.5-3B in INT4 fits in ~2GB of RAM, but decode throughput is gated by how fast you can stream weights through the memory hierarchy. On DDR5-5600 (~90 GB/s theoretical), expect ~60–80 tokens/sec for batch-1 decode. For higher throughput, batch requests or add nodes—do not buy a bigger GPU.
-
The router's embedding model must stay hot in L3 cache. The all-MiniLM-L6-v2 checkpoint is ~90MB. On a modern server CPU with 32MB+ L3, the working set fits comfortably, keeping inference at 1–2ms. If you scale to a larger encoder (e.g., 110M-parameter BGE), expect 5–8ms routing latency and plan accordingly.
-
Neo4j vector index performance scales with the number of domain nodes. For most B2B deployments (hundreds to low thousands of domain nodes), query latency is sub-10ms. If your graph grows to millions of nodes, consider a dedicated vector store for the routing layer and keep Neo4j for the structural traversals.
Closing
The compound AI architecture is not a research proposal—it is a deployment pattern that any team with access to a knowledge graph, a sentence-transformer, and a locally served SLM can ship in weeks. The engineering is unglamorous: labeled exemplars, centroid classifiers, a Cypher query for domain metadata, and a routing function that fits in 50 lines. The impact is not: you reclaim 90% of your inference budget and gain latency, privacy, and failure isolation properties that no single-vendor API call can provide.
If you are a CTO or VP of Engineering evaluating this pattern for your stack, the decision framework is simple. Audit your query logs. Classify a random sample of 1,000 queries by the tier that could handle them. If more than half land in Tier 2, the compound architecture pays for itself before the quarter ends. The harder question is not whether to build it, but how aggressively to shift traffic off the frontier tier—and that is a calibration problem, not an architecture problem.