Skip to main content
TACAVAR
AI Technology

Why Agent Routing Matters More Than Prompt Engineering in Production AI

Prompt engineering has a hard ceiling. The real fix is routing precision tasks out of the model and into deterministic code.

The model said it understood, then hallucinated for 40 minutes.

It was a scheduled job. Simple task: read a log file, extract error counts, post a summary to Slack. The prompt was three paragraphs of explicit instructions, edge cases, and output format requirements. The model returned a beautifully formatted report. Every number was wrong.

Not slightly wrong. Invented. The log file had 47 errors. The report said 12. The “top error category” did not exist in the source file. The model had fabricated a plausible-sounding summary because the prompt asked for one, and it was easier to generate confidence than admit uncertainty.

This is not a rare failure mode. It is the default failure mode of mid-tier models operating on unfamiliar technical work. And it is why prompt engineering — the act of writing better instructions — has a hard ceiling in production AI systems.

The Failure Mode: Confident Confabulation

Large language models are trained to produce coherent, helpful-sounding text. When faced with a task they cannot perform accurately, they do not pause and ask for clarification. They generate. The result is a confabulation: a response that is syntactically correct, semantically plausible, and factually wrong.

In a chat interface, this is annoying. In an automated pipeline, it is dangerous. A bad Slack summary is recoverable. A bad trading signal, a bad infrastructure check, or a bad deployment decision is not.

The problem is not that the model is “stupid.” The problem is that the architecture assumes the model can be trusted to handle the task if the instructions are good enough. That assumption breaks down the moment the task involves data the model has not seen, formats it has not been trained on, or logic that requires precise computation rather than pattern matching.

Why Better Prompts Did Not Fix It

We tried. We added few-shot examples. We added chain-of-thought reasoning steps. We added explicit constraints: “If you are uncertain, say so.” None of it worked reliably.

The reason is structural. Prompt engineering operates on the input layer of a black box. You are trying to influence behavior by shaping what goes in, without any control over what happens inside. When the model’s internal weights produce a high-confidence hallucination, no amount of prompt refinement can override that. The model does not “know” it is wrong. It does not have access to ground truth. It only has access to its training distribution, and your task is outside it.

This is especially true for mid-tier models — the ones most teams actually run in production because they are fast and cheap. A frontier model might catch its own error. A mid-tier model will not. It will produce the wrong answer with the same tone of authority as the right one.

What This Reveals About Model Calibration

Calibration is the alignment between a model’s confidence and its actual accuracy. A well-calibrated model says “I am 90% sure” and is right 90% of the time. Most mid-tier models are poorly calibrated on technical tasks. They are overconfident on unfamiliar inputs and underconfident on familiar ones.

In an agent stack, this means you cannot use the model’s output tone as a signal of reliability. A hesitant response might be correct. A confident response might be fabricated. The only way to know is to verify against ground truth — which means you need an architecture that routes tasks to verifiable execution paths, not just a better prompt.

The Architectural Fix: Route Hard Tasks Before the Model Improvises

The solution is not to write better prompts. It is to stop asking the model to do things it cannot do reliably.

At Tacavar, we migrated our scheduled workloads to an explicit routing layer. The pattern is simple:

  1. Classify the task before sending it to any model. Is this a pattern-matching task (summarize text, classify sentiment) or a precision task (count errors, parse structured data, compute a signal)?
  2. Route pattern-matching tasks to the model. These are what LLMs are good at. The prompt can be minimal because the task is inside the model’s training distribution.
  3. Route precision tasks to deterministic code. Parse the log with regex. Count with Python. Compute the signal with pandas. The model never sees the raw data. It only sees the verified output.
  4. Use the model for what remains. Synthesize the verified results into human-readable summaries. Generate the Slack message from structured data the model did not produce.

This is the opposite of the “agent does everything” architecture that is popular in demo videos. It is less elegant. It requires more code. It also does not hallucinate for 40 minutes.

How Tacavar Migrated Scheduled Workloads

Our stack runs across two droplets with a distributed job queue. Previously, scheduled tasks were sent to a single agent endpoint with a detailed prompt. The agent had access to logs, APIs, and databases. It was expected to figure out what to do.

We replaced that with a dispatcher that knows the task type before it sends anything to a model:

  • Health checks → deterministic Python scripts that parse system metrics and return JSON.
  • Log analysis → structured queries against Loki/Tempo, results fed to the model as pre-digested tables.
  • Trading signal generation → raw FRED API data processed through a validated pandas pipeline, model only writes the narrative summary.
  • Deployment decisions → never touch a model. Pure infrastructure code with explicit gates.

The model’s role shrank. Its accuracy improved. The system became debuggable because every failure path was explicit: either the data was wrong, the code was wrong, or the prompt was wrong. You no longer had to guess which layer produced the hallucination.

What Founders Should Audit in Their Own Stack

If you are running AI automation in production, ask these questions:

1. Does any task mix pattern-matching and precision work in the same prompt?

If you are asking a model to both interpret unstructured text and compute exact values, you have a routing problem. Split them.

2. Can you trace every output back to a deterministic step?

If a summary contains a number, you should be able to point to the exact line of code or database query that produced it. “The model said so” is not a valid provenance chain.

3. Do you have a classification layer before the execution layer?

If every task goes to the same endpoint with the same model, you are treating a router like a Swiss Army knife. It will cut you.

4. Are your prompts getting longer over time?

Longer prompts are a smell. They indicate you are trying to compensate for architectural gaps with instruction verbosity. The fix is routing, not more text.

5. Do you test for hallucinations, or just for correctness?

Correctness tests pass when the answer is right. Hallucination tests pass when the answer is wrong in a specific, detectable way. You need both. Feed your system inputs designed to trigger confabulation and verify that it routes them to code instead of generating fiction.

The Hard Truth

Prompt engineering is a local optimization. It makes a bad architecture slightly less bad. It does not fix the fundamental issue: models are not reliable execution engines for arbitrary technical tasks.

Agent routing is a global optimization. It changes what you ask the model to do, not how you ask it. The result is a system where the model’s strengths — pattern recognition, synthesis, natural language generation — are used intentionally, and its weaknesses — precise computation, unfamiliar data formats, long-horizon reasoning — are bypassed entirely.

This is not theoretical. At Tacavar, the migration reduced our scheduled job error rate by an order of magnitude. The failures that remain are code bugs, not model hallucinations. Code bugs are debuggable. Hallucinations are not.

If you want production-grade AI workflows instead of brittle prompt stacks, you need an architecture that knows what the model can do and refuses to let it try the rest.


Built it. We optimize it.

Tacavar designs autonomous systems with explicit routing layers, deterministic execution paths, and model boundaries that match actual capability.