The demo always works. Someone wires up a few AI agents, asks them to pull a report or reconcile a ledger, and it runs end to end on stage. Everyone nods. Then it goes to production, runs a few thousand times against messy real inputs, and the cracks show. The same request starts producing different answers. A step that worked on Tuesday breaks on Thursday because the model provider quietly shipped an update. Six months later the project is quietly shelved.
This is close to the median outcome rather than a rare one, and the reasons are structural, not a matter of better prompting. Below I will lay out why that happens, which jobs runtime AI is actually good for, and the one architectural change that makes most of the problem go away.
The pattern in the numbers
MIT's NANDA initiative studied enterprise generative AI in 2025 and found that roughly 95% of pilots delivered no measurable impact on profit and loss. Only about 5% reached real value (MIT NANDA, 2025). S&P Global Market Intelligence found that 42% of businesses scrapped most of their AI initiatives in early 2025, up from 17% a year earlier, and that the average organization abandoned 46% of its AI proof-of-concepts before they reached production (S&P Global, 2025). Gartner expects more than 40% of agentic AI projects to be canceled by the end of 2027, citing cost, unclear value, and inadequate controls (Gartner, 2025).
One caveat worth stating plainly: the MIT figure measures pilots that showed no P&L impact, not systems that crashed, and the Gartner number is a forecast. Read carefully, though, they point the same direction. Most of these projects do not die because the model is not smart enough. They die because the thing built around the model cannot be made reliable enough to trust with real work.
Same prompt, different answer
Start with the property that surprises people most: large language models are not deterministic, and you cannot fully configure that away. Setting the temperature to zero helps, but it does not make the output repeatable. Researchers at Thinking Machines Lab traced the real cause to a lack of batch invariance in inference servers. The result of a calculation depends on the batch it happens to be processed with, and batch size shifts with server load from one request to the next. In their test, a standard setup produced 80 unique outputs across 1,000 identical prompts. Only after rebuilding the kernels to be batch-invariant did all 1,000 come back identical (Thinking Machines Lab, 2025). Floating-point arithmetic is part of the story too, since adding numbers in a different order can give a slightly different sum, and that is enough to change which word the model picks next.
For an everyday illustration, a 600-billion-parameter model asked how many times the letter D appears in "DEEPSEEK" has returned 2, 3, and on some runs 6 or 7 across repeated identical tries (Computerworld, 2025). If a model cannot deterministically count letters in a word, it is worth asking whether it should be the thing deterministically reconciling your ledger.
The math that turns 95% into one in five
Non-determinism would be survivable if each step were nearly perfect. The trouble is that agents chain steps, and reliability multiplies. If every step in a workflow is independently 95% reliable, a ten-step chain is 0.95 to the tenth power, which is about 59%. At 90% per step, ten steps land near 35%. Take five steps at 85% each and you are at roughly 20%, a system that fails four times out of five.
This is arithmetic, not a measured statistic, but the mechanism behind it is documented. Research on where agents fail identifies error propagation as the primary bottleneck, with an early mistake cascading through the rest of the run (arXiv, 2025). It can be worse than independent multiplication, because once a model's context contains its own earlier error, it becomes more likely to err again. Five impressive agents do not add up to one impressive system. They multiply down to an unreliable one.
Hallucination is not a bug you patch
Hallucination gets talked about as a defect that the next model will fix. The evidence does not support that hope. OpenAI's own o3 and o4-mini reasoning models hallucinated on 33% and 48% of questions in its PersonQA benchmark, roughly double the rate of the earlier o1 (TechCrunch, 2025). The newer, more capable models hallucinated more, not less. A 2025 paper from OpenAI researchers argues the root cause sits in how models are trained and graded, which rewards confident guessing over admitting uncertainty (OpenAI, 2025). Even on the narrow, well-suited task of summarizing a document faithfully, the best models on Vectara's public leaderboard still hallucinate a couple of percent of the time, and many do far worse (Vectara). A couple of percent is fine for a first draft a human will read. It is not fine for a number that goes on a regulatory filing.
The model changes and nobody tells you
Even if you got an agent stable, you do not own the model underneath it. Providers retire and retune models on their own schedule. OpenAI announced in January 2026 that it was retiring six models, including GPT-4o and GPT-4.1, with a cutoff weeks later (OpenAI deprecations). Anthropic gives a minimum of 60 days notice before retiring a released model (Anthropic deprecations). When the model behind your workflow changes, prompts that were tuned for the old one can quietly regress, structured output can shift just enough to break a JSON parser, and you find out from a downstream failure rather than a changelog. You are building on ground that moves.
Where runtime AI actually belongs
None of this means AI is the problem. It means the variability that makes a model wonderful for some jobs makes it a liability for others. For open-ended work, that variability is a feature. Drafting copy, brainstorming options, summarizing a document for a person who will read it critically, exploring a messy dataset: a human is in the loop, and a fresh phrasing each time is welcome.
For a recurring process that produces a number someone has to sign, the same variability is a defect. As Anthropic's head of financial services put it, in financial services people do not have the luxury of inconsistent outputs, and regulators expect that an action can be reproduced with the same outcome years later (via GAO, 2025). A reconciliation that lands on a different answer depending on server load is not automation you can defend in an audit. The honest rule is to match the tool to who consumes the output. Creativity for people, determinism for ledgers.
Compile the intelligence once, then run code
You do not have to keep the model in the runtime loop to get the benefit of it. Most teams never try. You can use AI where it shines, at design time, and run something deterministic in production.
Think about a compiler. You do not ship the compiler to every customer and recompile your application on every click. You compile once, ship the deterministic binary, and run it a billion times identically. Infrastructure-as-code works the same way: you express intent once, generate a validated artifact, and run it the same across environments. Text-to-SQL is the everyday version, where a model writes the query once and the database executes it deterministically forever. A 2026 research paper on "Compiled AI" describes exactly this trade. After a compile phase, workflows execute deterministically with no further model calls, which the authors frame as trading runtime flexibility for output you can predict, audit, and run cheaply (arXiv, 2026).
That is the bet Dittah is built on. You describe a workflow in plain English, AI generates Python you can read and test against real data, and publishing freezes that code into a versioned, immutable artifact. Production runs execute the frozen code: same input, same output, no model in the loop, and no per-run token bill. The intelligence went into building the workflow. It does not have to be re-summoned, with all its variance, every time the workflow runs. You can watch that happen in the demo, and we worked through the cost side of the same idea separately.
Bottom line
Agents fail in production for reasons you cannot prompt your way out of: outputs vary on identical inputs, errors compound across steps, hallucination is structural, and the model itself moves under you. For conversational and creative work, that is a price worth paying. For reports, reconciliations, and anything an auditor will read, it is not. The way through is not a better agent. It is to let AI do the creative work once, at design time, and let deterministic code do the running. If your workflow produces a number someone signs, test the frozen-code approach against your messiest recurring process and see whether it holds where the demo did not.
Sources are linked inline and reflect material available as of February 2026. Reliability figures from surveys and forecasts are presented as such; the step-reliability math is an illustration, not a measured result.