Skip to main content

Technical Blog

Context Engineering Is the New ETL: Why Prompt Engineering Died in 2026

context-engineeringllmagentsragarchitectureenterprisestrategy

Prompt engineering was a 2023 parlor trick. Context engineering is the discipline winning enterprise AI teams are building in 2026 — a pipeline problem that looks more like ETL than copywriting. Here is why it matters, how it works, and where the ROI lives.

The prompt engineering era is over

In 2023 a LinkedIn headline that said "Prompt Engineer" could command a six-figure salary. In 2026 it is a punchline. Not because prompts do not matter — they still do — but because every serious production problem that used to be framed as "write a better prompt" has been reframed, correctly, as "feed the model a better context".

The shift is not cosmetic. It changes who owns the problem (data engineers, not copywriters), what tooling it requires (retrieval, ranking, compression, schema validation — not a Notion doc of clever phrasings), and where the failure modes live (pipelines, indexes, and freshness SLAs — not the last sentence of a system prompt).

The industry has converged on a name for this discipline: context engineering. It is the art and engineering of assembling — at runtime, under latency and token budgets — the minimal, correct, trusted set of tokens that a model needs to do one specific job. It is closer in spirit to ETL than to creative writing. And for B2B enterprises trying to move agents from demo to production, it is now the single highest-leverage competency a data or platform team can build.

This post explains why prompt engineering collapsed under production load, what context engineering actually is, and what a defensible context stack looks like in 2026.

Why prompt engineering collapsed

Prompt engineering made sense when the dominant deployment shape was a single human typing into a chat box with zero grounding. In that world, the prompt was the system. You tweaked wording, added few-shot examples, appended "let's think step by step," and watched accuracy move a few points.

Three things killed that paradigm in production.

1. The unit of work became the agent turn, not the chat turn. In a modern agent, a single user intent can expand to dozens of LLM calls: planner, retriever, tool router, tool caller, reflector, critic, summarizer, writer. Each call consumes a context. Hand-authoring each of those contexts is not engineering — it is craft. Craft does not scale across 50 tools and 12 tenants.

2. Context windows stopped being a constraint and became a liability. When you have 200K, 1M, or 2M tokens available, the temptation is to stuff everything in and pray. This works in demos and fails in production. Long contexts are measurably worse at needle-in-a-haystack recall past ~40–60% utilization, are dramatically more expensive, introduce unpredictable latency due to prefill time, and in agent loops accumulate "context rot" — stale tool outputs, dead tangents, contradicted instructions — that actively degrade future decisions.

3. Reliability became a compliance problem, not a UX problem. A banking copilot that hallucinates once in a demo is a funny anecdote. The same copilot hallucinating once in 10,000 customer interactions is a regulatory incident. You cannot audit a prompt. You can audit a pipeline. Compliance officers want lineage: which document, at which version, retrieved by which query, ranked how, injected at which position. That is an ETL artifact, not a clever paragraph.

Net: the high-order term in production quality stopped being "which words are in the prompt" and became "which bytes of ground truth are in the context window right now, and how did they get there".

Prompt engineering vs. context engineering

AxisPrompt EngineeringContext Engineering
Primary artifactA stringA pipeline
Primary ownerProduct / UXData / Platform engineering
Optimization targetWording, few-shot selectionRetrieval, ranking, compression, schema adherence
EvaluationVibe checks, eyeball testsOffline retrieval metrics, online task success, unit tests on tool JSON
Failure modeAmbiguous wordingStale / irrelevant / poisoned context
Cost structureOne-off author timeOngoing infra: vector DB, rerankers, caches, observability
Compliance postureOpaqueAuditable lineage per token
Scales withAuthor skillData quality and pipeline rigor

The category error of 2023–2024 was treating context engineering as "prompt engineering but longer". It is not. It is a systems discipline with its own performance model, its own anti-patterns, and its own ROI curve.

A working definition

Context engineering is the practice of designing and operating the system that, for a given task at a given moment, returns a minimal token sequence CC that maximizes the probability of a correct model output while respecting three budgets:

maxC P(correcttask,C)s.t.CBtokens,  (C)Blatency,  $(C)Bcost.\max_{C} \ P(\text{correct} \mid \text{task}, C) \quad \text{s.t.} \quad |C| \le B_{\text{tokens}}, \ \ \ell(C) \le B_{\text{latency}}, \ \ \$(C) \le B_{\text{cost}}.

Three observations fall out of that formulation.

  • More context is not better context. The objective is a conditional probability. Adding tokens that do not raise it is strictly negative once you account for latency and cost. In practice it is also negative before those costs, because irrelevant tokens reduce attention mass on relevant ones.
  • The budget constraints are hard, not soft. Enterprise SLAs are measured in milliseconds and dollars per 1K calls. A 200K-token megaprompt that "works" at $0.30 per call is not a solution — it is an unpriced subsidy from your AI budget line item to your product manager's ego.
  • The objective is per-task, not global. The optimal context for "classify this ticket" has nothing in common with the optimal context for "draft a reply to this ticket". Treating them as the same problem is how you end up paying frontier prices for classification.

The four pillars of a production context stack

A context engineering system — the thing that sits between your raw data and your model call — resolves four concerns. Any production stack that skips one of them will leak quality, cost, or both.

1. Select: retrieve the right candidates

The first job is recall: make sure the tokens that would answer the question are in the candidate set, before any ranking or compression happens. In 2026 this is almost never "one vector index" anymore. It is a hybrid:

  • Lexical (BM25 / SPLADE) for exact identifiers, codes, product SKUs, legal citations.
  • Dense embeddings for semantic paraphrase.
  • Graph traversal (GraphRAG) for multi-hop entity questions where the answer is a path, not a document.
  • Structured SQL against your warehouse for anything numeric, aggregated, or time-bounded — which, for most B2B enterprises, is where the actual answers live.

The modality is dictated by the question, not by the tool you bought. The router that decides which retrieval modality to use is itself a semantic routing problem.

2. Rank: keep only what earns its tokens

Recall-oriented retrieval is cheap and fat. A cross-encoder reranker on the top-k candidates is the single highest-ROI component of most enterprise RAG stacks, and it is still routinely omitted. The math is straightforward: if your downstream model costs $0.02 per call and your reranker costs $0.0002 per call, cutting your injected context by 60% is a near-pure win — lower cost, lower latency, higher accuracy because the model's attention is no longer diluted.

Ranking is where "context rot" gets cleaned up in an agent loop. After each tool call, rerank everything currently in the working context against the current subgoal. Tokens that were relevant three steps ago but are not relevant now should be evicted, not carried.

3. Compress: pay for information, not tokens

Compression is where the budget constraints get enforced. The two moves that matter:

  • Structural compression. Collapse retrieved documents into the schema the downstream step actually consumes. If the planner only needs a list of (entity, relation, value) triples, do not hand it three PDFs. Extract once, reuse forever.
  • Semantic compression. A small, cheap model summarizes or distills long-form sources into task-conditioned briefs before the expensive model sees them. This is a textbook SLM application: a 3B–8B model at $0.0005 per call, saving a frontier call at $0.05.

The mental model: every token in the final context should be earning its place. If you cannot articulate why a specific span is there, it is not in the context — it is polluting the context.

4. Persist: give the agent a memory the database already has

The third-order failure mode of long-running agents is re-deriving state on every turn. The agent asks "what is this customer's plan tier" five times in one session, hitting the LLM each time, because nobody wrote that fact down.

A production context stack persists three distinct layers:

  • Session memory: facts established in this conversation, stored as structured key-value, not as chat transcript.
  • User/tenant memory: long-lived facts about the entity the agent is acting on behalf of.
  • Organization memory: canonical business facts — policies, SLAs, product catalog — versioned and authoritative, not "summarized from a Slack thread six months ago".

The database you already own is almost always a better memory than any vector store for anything you can express in a row. The "agentic memory" vendors selling you a second, fuzzier source of truth are selling you a consistency problem.

The context budget: a quick numerical model

To make the cost argument concrete, consider a realistic B2B agent: 10,000 users, 8 agent turns per user per day, average 3 LLM calls per turn.

That is 240,000 LLM calls per day, or ~87.6M per year.

Two stacks:

StackAvg input tokensAvg output tokens$/1M in$/1M out$/callAnnual cost
Megaprompt RAG (frontier, 32K stuffed context)28,0006003.0015.000.093$8.14M
Engineered context (tiered, reranked, compressed)3,5005003.0015.000.018$1.58M

Same task, same model family, same user experience — a 5.2× cost delta from context discipline alone. That delta compounds with model tiering (cheap model for easy turns, frontier only when the router is uncertain), at which point you are looking at the 90%+ reductions that the compound AI post describes.

For a CFO, this is the relevant chart. Nobody's board cares that your prompt is eloquent. They care that your AI line item is not a leak.

A minimal context engine in Python

To make the architecture concrete, here is a skeletal context engine. It is not a library; it is the shape of the thing you should have before you wire a single tool call to a frontier API.

from dataclasses import dataclass, field
from typing import Callable, Protocol
 
@dataclass
class Candidate:
    source: str           # e.g. "kb://policies/v4#sec-3.2"
    text: str
    score: float = 0.0
    tokens: int = 0
 
@dataclass
class Task:
    intent: str           # "classify", "extract", "answer", "plan", ...
    query: str
    budget_tokens: int
    budget_latency_ms: int
 
class Retriever(Protocol):
    def fetch(self, task: Task) -> list[Candidate]: ...
 
class Ranker(Protocol):
    def rerank(self, task: Task, cands: list[Candidate]) -> list[Candidate]: ...
 
class Compressor(Protocol):
    def compress(self, task: Task, cands: list[Candidate]) -> list[Candidate]: ...
 
@dataclass
class ContextEngine:
    retrievers: list[Retriever]
    ranker: Ranker
    compressor: Compressor
    memory: Callable[[Task], list[Candidate]] = field(default=lambda _: [])
 
    def build(self, task: Task) -> str:
        # 1. Select: fan out across retrievers + pull structured memory
        pool: list[Candidate] = list(self.memory(task))
        for r in self.retrievers:
            pool.extend(r.fetch(task))
 
        # 2. Rank: keep only what earns its tokens
        ranked = self.ranker.rerank(task, pool)
 
        # 3. Compress: task-conditioned distillation to fit the budget
        packed = self.compressor.compress(task, ranked)
 
        # 4. Enforce the hard token budget deterministically
        out, used = [], 0
        for c in packed:
            if used + c.tokens > task.budget_tokens:
                break
            out.append(c)
            used += c.tokens
 
        return self._render(task, out)
 
    def _render(self, task: Task, cands: list[Candidate]) -> str:
        # Explicit, auditable block structure — one block per source, with lineage.
        blocks = [f"<ctx src=\"{c.source}\">\n{c.text}\n</ctx>" for c in cands]
        return f"# TASK: {task.intent}\n\n" + "\n\n".join(blocks) + f"\n\n# QUERY\n{task.query}\n"

Three things about this skeleton are non-negotiable in any real deployment.

  • The budget is enforced in code, not in the prompt. A model will not reliably "only use the top 5 sources if you are short on tokens". Your pipeline will.
  • Every injected block carries its source. This is the lineage that turns the agent from a liability into an auditable artifact. When a compliance officer asks why the model said X, you can point at the rendered context, not at vibes.
  • Memory is a first-class retriever, not an afterthought. The memory hook is where your warehouse, your session state, and your tenant facts enter the context. Skipping it is how you pay a frontier model to re-derive facts that a SELECT would have answered in 4 milliseconds.

Anti-patterns that are currently costing enterprises real money

After a few dozen engagements across banking, retail, and industrial clients, the same failure modes keep showing up. If your stack does any of these, you have a context engineering problem wearing a prompt engineering mask.

The megaprompt. A 12,000-token system prompt that has grown by accretion for 18 months and that nobody dares touch. Every new edge case adds a paragraph. Fixes to one section break another. This is not a prompt — it is an unversioned, untested, untyped configuration file written in natural language. Replace it with a typed task descriptor, a retrievable policy corpus, and a thin system prompt that only encodes role and output schema.

RAG stuffing. "We just retrieve the top-20 chunks and paste them into the prompt." No reranker, no deduplication, no query-specific compression. This is a classic cost disaster: you are paying frontier prices to have the model do the ranking it could have gotten from a $0.0002 cross-encoder call.

The shared memory blob. A single growing JSON that every agent step reads and writes, stored as chat transcript rather than structured state. Within 5–10 turns it becomes a graveyard of contradicted plans, dead tool outputs, and stale user preferences. Every call pays to ignore most of it.

Tool-output hoarding. The agent calls a tool, gets back 40KB of JSON, and pastes all of it into the next LLM call "in case it is useful". It is not useful. Extract the two fields the next step needs. Discard the rest. Log the full response for observability, not for the context.

No offline evaluation of the retrieval layer. Teams A/B test prompt wordings for weeks but have never measured recall@k of their retriever against a labeled set. The retriever is where 70% of your quality ceiling lives. Measure it.

Where the B2B money is

Context engineering is not a research topic. It is a procurement line item that enterprises are actively funding in 2026. The engagement shapes that consistently pay back are:

  • Agent context audits. Take an existing production agent that is "working but expensive / slow / flaky" and instrument it for context utilization. Measure, per call, which injected tokens were actually attended to and which were waste. Typical finding: 60–85% of injected tokens are dead weight. Typical outcome: 3–8× cost reduction with improved accuracy, in 4–8 weeks.

  • Retrieval pipeline rebuilds. Replace a single-modality "vector DB only" stack with a hybrid retrieval + graph + warehouse architecture, add a reranker, and wire in structured memory. This is the single highest-leverage intervention I see in enterprise AI programs, and it is blocked in 80% of cases not by budget but by the political question of who owns the retriever — data engineering or the AI team.

  • Tiered inference with engineered contexts. Small models with tight, compressed contexts on the hot path; frontier calls only when the router is uncertain; all grounded on tabular data where the business actually lives. This is where the 90%+ cost reductions live.

  • Context observability. Per-call lineage dashboards: which sources were retrieved, which were ranked in, which were compressed out, which were finally injected. Almost no enterprise has this. All of them need it. It is the "Datadog for agents" layer, and it is currently table stakes for anything touching regulated data.

The next twelve months

Two predictions for the rest of 2026, both of which will look obvious in hindsight.

First, "context engineer" will become a standard job title, in the way "data engineer" became one between 2015 and 2019. The skill is distinct: retrieval architectures, reranker ops, compression strategies, token-budget arithmetic, memory schema design, and evaluation pipelines for agent loops. Most of the people currently doing this work have job titles like "senior ML engineer" or "applied AI lead" and will rebrand accordingly.

Second, the context layer will be where the next wave of AI infra spend lands. Not more GPUs. Not more frontier tokens. Rerankers, structured memory stores, context observability, and offline evaluation infrastructure for retrieval and ranking. The CFO-approvable pitch for all of it is the same: we are already spending $X on LLM calls; tightening the context cuts it by 3–10× at equal or better quality. That is a business case, not a research agenda.

Prompt engineering is not coming back. It was a transitional skill for a transitional era. What replaces it is a real engineering discipline — unglamorous, pipeline-shaped, auditable, and measurable — sitting exactly where enterprises make or lose money on AI.