Skip to content
MF

Writing.

Shipping agentic AI to real users is not the same as building a demo

2026-03-234 min read

Why shipping agentic features in production is mostly software engineering, not AI.

A demo of an agentic feature is a controlled environment. The example query was chosen because it works. The model behaved well three times in a row. The user is on the team building the thing. The audience is impressed, and they should be, because the demo is genuinely interesting.

A demo of an agentic feature has almost nothing in common with a shipped agentic feature. I have spent the last year shipping agentic AI inside two products. The most useful thing I can say about that work is that the part of it that looks like AI is the part that takes the least effort. The part of it that looks like ordinary product engineering is where the actual difficulty lives.

I will give one example. We had a feature where an agent decides whether a user’s input justifies escalating to a human reviewer. The interesting question is what the agent should do. The hard question is what happens when the agent is uncertain. In a demo, you handle the uncertain case by picking an example that is not uncertain. In production you handle it by deciding, in advance, what the system does when the model returns a low-confidence result, what it does when the model times out, what it does when the model returns valid JSON that does not match the schema you expected, what it does when the model returns text instead of JSON, what it does when the model is rate-limited, what it does when the model is unavailable, and what it does when the model returns the right answer ninety-nine times in a row and then the hundredth time returns something that would embarrass the company.

Those are not AI questions. They are software engineering questions. Most of them have been solved before, in other contexts. Timeouts, retries, fallbacks, structured outputs, schema validation, observability. The same techniques you would apply to any unreliable external service, applied to a service that happens to be probabilistic and a service that is being asked to make decisions a human used to make.

What changes is that the failure modes are richer. A traditional API call either succeeds, fails, or times out. An LLM call can succeed with a wrong answer, succeed with a partially right answer, succeed in a way that looks right but is missing a critical piece, or succeed in a way that is overconfidently wrong. The set of bad outcomes is larger and more confusing. The cost of treating an LLM call as an ordinary API call is that you ship code that handles three of those failure modes and silently fails on the rest.

The architectural choice that has mattered most to me, across the work I have shipped, is the discipline of treating every external interaction as a capability that should be bounded. Whether the capability is an LLM, a tool the model uses, a knowledge source the model retrieves, or an integration with an external service, the system should route through a layer that knows what the capability is supposed to do, what its failure modes are, what to do when it fails, and how to log what happened. The model does not get to scatter calls across the codebase. The codebase asks the boundary layer for the capability it needs.

That sounds like generic software hygiene. It is. The point is that the LLM does not change what good software looks like. It makes good software more important.

The other thing I would say after a year of this is that the model is the smallest part of the feature. The model is what the feature can do at peak. The wrapping is what the feature actually does most of the time. Most of the work that ends up sitting next to an agentic feature is structured output handling, schema design, deterministic fallback paths that work without the model at all, observability that tells you when the model is degrading, evals that catch regressions, and user interface decisions about how to present uncertainty. None of that is novel engineering. All of it has to be done well or the feature is a demo.

If I were starting from scratch on an agentic project, I would build the deterministic version first. A version of the feature that works without any LLM call, using whatever fallback logic is available. Then I would wrap the LLM around it as a quality improvement, not as the core. The LLM makes the experience better. The deterministic path is what makes the feature shippable.

That is the gap between a demo and a product. The demo is the LLM alone. The product is the LLM plus all the engineering that catches the LLM when it fails.