We ran a benchmark. Five real enterprise data tasks, correct answers computed in advance by independent code, every result graded to the cent. On one side, frontier models like Claude and Gemini over a cloud API. On the other, an open-source model running on a single desktop machine, no GPU cluster, no internet required.
The headline numbers: the open model passed four of the five tasks on the first try, five of five inside a verification loop, and turned a raw CSV into a verified four-page executive report in about a minute, on hardware that never sent a byte outside the building. And yet the frontier models kept an advantage the open model could not close on its own, one that shows up exactly where production systems break.
Prefer to watch? Here is the short version in about three and a half minutes. The full write-up continues below.
This article is the full write-up: the tasks, the results, the one failure that repeated across twenty attempts, what each model class is actually worth paying for, and the architecture that ends the either-or. We ran the comparison inside our own harness, which supports local and cloud models behind the same workflow, so switching engines is a configuration change rather than a rewrite. Any harness with that property will let you reproduce this. Spoiler for the impatient: the honest answer is not a winner. It is an architecture.
The test: five tasks every data team recognizes
- Messy CSV cleanup. Three date formats, three currency formats, region names typed five different ways, duplicates, blanks.
- Aggregation. A monthly revenue summary by region from 200 orders.
- Join plus business logic. Customers and orders in separate files, churn flags, customer segments, including the customers with zero orders that trip up naive joins.
- Nested JSON flattening. An API export with empty arrays and missing branches, flattened to line items.
- Structured workflow generation. Produce a valid JSON pipeline spec, read to filter to join to aggregate to write, from a plain-English request.
Passing means matching the independently computed answer exactly. Not "looks right." Exactly.
Finding one: open models have gotten genuinely good, fast
We ran two generations of open-weight models, both on a single desktop machine, no GPU cluster: a 2024-era 14B model and a 2026-era 30B coding model, both quantized, both served locally.
| Open model, 2024 generation | Open model, 2026 generation | |
|---|---|---|
| Passed on the first try | 3 of 5 | 4 of 5 |
| Passed with a verification loop | 5 of 5 | 5 of 5 |
| Typical generation time | 10–24 seconds | 4–8 seconds |
Two generations apart, first-shot accuracy improved, speed tripled, and, this is the part that matters, both reached five out of five once a verification loop was wrapped around them: execute the generated code, feed the error or the output difference back, retry. The 2026 model also passed a production-style run end to end: raw CSV in, a four-page executive PDF report out, every figure verified against the source data to the dollar, generated in about a minute on local hardware.
Finding two: what open models still cannot do alone
The hardest task, the messy CSV, exposed a consistent ceiling, and it is worth being precise about it, because it is exactly where frontier models earn their keep.
The failure was not syntax. It was diagnosing subtle library behavior from evidence. One example: the most popular data library in the world silently converts the string "NA", as in North America, into a missing value when it reads a file. Every open model we tested, across more than twenty attempts in multiple configurations, stared at the evidence of this bug without ever finding the root cause. Told explicitly what to fix, they fixed it instantly. The coding ability was there. The diagnostic knowledge was not.
We also tested the fashionable idea of agentic self-improvement: the local model critiquing its own failed code and rewriting its own instructions in a loop. It never converged. Eight rounds of fixing symptoms, regressing on earlier fixes, and drifting off protocol. With a frontier model writing that same repair feedback instead, the local worker converged in three rounds. The frontier model's advantage was not in writing the code. It was in knowing why the code was wrong.
Two harness changes mattered more than any model change, and both are free:
- Keep the conversation history. A stateless retry loop oscillated forever. The identical model, with its full history of attempts and errors in context, converged in four rounds. If your pipeline retries model calls, send the history.
- Prescriptive prompts beat clever feedback. A prompt template that bakes in house rules, explicit format strings, explicit mapping tables, fail loudly on unexpected values, no silent type inference, turned silent data corruption into self-diagnosing errors, and turned the retry loop into a straight line.
The criteria that actually decide it
Privacy and data residency. This one is binary, not a spectrum. If the data legally cannot leave your network, claims data, patient records, customer financials, the frontier API is off the table for that workload regardless of quality, and a local model that passes verification beats a better model you are not allowed to use. If the data can leave, the frontier providers offer enterprise controls, retention policies, no-training commitments, regional processing, that satisfy many compliance teams. We covered the sharper edges of that in keeping sensitive data out of third-party LLMs.
Accuracy. Frontier models are meaningfully stronger at first-shot correctness on hard, novel problems, and at diagnosis, the "why is this failing" reasoning the open models lack. But for the common majority of data work, our tests say the gap closes to zero inside a verification harness: both classes of model passed the same checks. The corollary cuts both ways. Unverified output from either class is not enterprise-grade, and verified output from either class is.
Cost per token. Current published prices per million tokens, as of this writing:
| Model | Input | Output |
|---|---|---|
| Claude Opus 4.8 | $5.00 | $25.00 |
| Claude Sonnet 5 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
| Gemini 2.5 Flash | $0.30 | $2.50 |
| Open model on your hardware | $0 marginal | $0 marginal |
The honest framing: per-token prices are small, and for low-volume design work the frontier bill is trivial. Our entire test campaign would have cost a few dollars. Local is not free either: you pay in hardware, electricity, and operations attention. The economics flip at high volume, thousands of runs a day, or at per-seat scale, where zero marginal cost compounds. And they flip hardest when a model runs on every request instead of once at design time, which is the trap we worked through in the real cost of AI at runtime.
Latency and operations. The frontier API wins on burst throughput and needs no capacity planning. A local model on modest hardware served our tasks in four to eight seconds each, more than adequate for design-time work, but someone has to own the server. Local inference has become dramatically easier, an afternoon rather than a quarter, but it is not zero.
Licensing. If you go local, prefer permissively licensed weights, Apache 2.0 among others. When procurement asks whether you can use this commercially, on-premise, with no strings, that is the clean answer.
The conclusion: determinism gets you both worlds
Here is the pattern that dissolved the either-or for us. Most enterprise data work does not need a model at runtime. It needs a model once, at design time, to write the transformation, the report generator, the pipeline. After that, the correct move is to freeze the verified code and run it deterministically forever: same input, same output, no model in the path, no per-run token cost, nothing to drift or hallucinate on run ten thousand. That is the same argument we made for reporting in are BI tools dead, and it applies to every task in our test suite.
Once you adopt that architecture, frontier versus local stops being an identity and becomes a per-task routing decision:
- Frontier model at design time when the problem is hard or novel, when you want that diagnostic reasoning, and when the task description and sample data are shareable. You pay cents, once.
- Local model at design time when the data cannot leave the building. Our results say that with history-preserving retries and prescriptive prompts, it handles the common cases to the same verified standard.
- Neither at runtime. The frozen code runs on your infrastructure either way.
The verification harness is what makes this safe, and it is model-agnostic by construction. Ours caught the same bugs whether the code came from a 2024 open model, a 2026 open model, or a frontier API. That is also your insurance policy against a fast-moving market: when a better model ships next quarter, you swap the engine and keep the guardrails, the workflows, and the audit trail. This is how Build Studio is built, and it is why the model dropdown is a configuration detail there rather than a commitment.
Bottom line
Do not pick a side. Pick a verification-first architecture, route each task to the cheapest model that passes verification under your data constraints, and freeze what passes. The models will keep changing. That decision framework will not.
If you want to see the pattern on your own data, build your first workflow against a task you already know the right answer to. That last part is the whole trick: always test AI against work you can check.
Methodology notes: five tasks with independently computed ground truth; generated code executed and graded automatically; local models run quantized on a desktop-class machine; retry loops capped at eight rounds; "pass" means exact agreement with precomputed answers. Pricing as published by Anthropic and Google at the time of writing; check the current pages before budgeting.