Resources / Insights

Claude, Gemini, or an Open Model on Your Own Hardware?

Scoreboard comparing a frontier model via cloud API against an open-source model on your own hardware across the same five enterprise data tasks: the frontier side is strongest on hard tasks and diagnoses subtle bugs but data leaves your network and tasks cost cents, the local side is strong on common tasks with zero marginal cost and data never leaves your network but needs help diagnosing subtle bugs. Both finish five out of five verified inside the same verification harness, which freezes approved code so no model runs at runtime

We ran a benchmark. Five real enterprise data tasks, correct answers computed in advance by independent code, every result graded to the cent. On one side, frontier models like Claude and Gemini over a cloud API. On the other, an open-source model running on a single desktop machine, no GPU cluster, no internet required.

The headline numbers: the open model passed four of the five tasks on the first try, five of five inside a verification loop, and turned a raw CSV into a verified four-page executive report in about a minute, on hardware that never sent a byte outside the building. And yet the frontier models kept an advantage the open model could not close on its own, one that shows up exactly where production systems break.

Prefer to watch? Here is the short version in about three and a half minutes. The full write-up continues below.

This article is the full write-up: the tasks, the results, the one failure that repeated across twenty attempts, what each model class is actually worth paying for, and the architecture that ends the either-or. We ran the comparison inside our own harness, which supports local and cloud models behind the same workflow, so switching engines is a configuration change rather than a rewrite. Any harness with that property will let you reproduce this. Spoiler for the impatient: the honest answer is not a winner. It is an architecture.

The test: five tasks every data team recognizes

Passing means matching the independently computed answer exactly. Not "looks right." Exactly.

Finding one: open models have gotten genuinely good, fast

We ran two generations of open-weight models, both on a single desktop machine, no GPU cluster: a 2024-era 14B model and a 2026-era 30B coding model, both quantized, both served locally.

Open model, 2024 generation Open model, 2026 generation
Passed on the first try 3 of 5 4 of 5
Passed with a verification loop 5 of 5 5 of 5
Typical generation time 10–24 seconds 4–8 seconds

Two generations apart, first-shot accuracy improved, speed tripled, and, this is the part that matters, both reached five out of five once a verification loop was wrapped around them: execute the generated code, feed the error or the output difference back, retry. The 2026 model also passed a production-style run end to end: raw CSV in, a four-page executive PDF report out, every figure verified against the source data to the dollar, generated in about a minute on local hardware.

Finding two: what open models still cannot do alone

The hardest task, the messy CSV, exposed a consistent ceiling, and it is worth being precise about it, because it is exactly where frontier models earn their keep.

The failure was not syntax. It was diagnosing subtle library behavior from evidence. One example: the most popular data library in the world silently converts the string "NA", as in North America, into a missing value when it reads a file. Every open model we tested, across more than twenty attempts in multiple configurations, stared at the evidence of this bug without ever finding the root cause. Told explicitly what to fix, they fixed it instantly. The coding ability was there. The diagnostic knowledge was not.

We also tested the fashionable idea of agentic self-improvement: the local model critiquing its own failed code and rewriting its own instructions in a loop. It never converged. Eight rounds of fixing symptoms, regressing on earlier fixes, and drifting off protocol. With a frontier model writing that same repair feedback instead, the local worker converged in three rounds. The frontier model's advantage was not in writing the code. It was in knowing why the code was wrong.

Two harness changes mattered more than any model change, and both are free:

The criteria that actually decide it

Privacy and data residency. This one is binary, not a spectrum. If the data legally cannot leave your network, claims data, patient records, customer financials, the frontier API is off the table for that workload regardless of quality, and a local model that passes verification beats a better model you are not allowed to use. If the data can leave, the frontier providers offer enterprise controls, retention policies, no-training commitments, regional processing, that satisfy many compliance teams. We covered the sharper edges of that in keeping sensitive data out of third-party LLMs.

Accuracy. Frontier models are meaningfully stronger at first-shot correctness on hard, novel problems, and at diagnosis, the "why is this failing" reasoning the open models lack. But for the common majority of data work, our tests say the gap closes to zero inside a verification harness: both classes of model passed the same checks. The corollary cuts both ways. Unverified output from either class is not enterprise-grade, and verified output from either class is.

Cost per token. Current published prices per million tokens, as of this writing:

Model Input Output
Claude Opus 4.8$5.00$25.00
Claude Sonnet 5$3.00$15.00
Claude Haiku 4.5$1.00$5.00
Gemini 2.5 Pro$1.25$10.00
Gemini 2.5 Flash$0.30$2.50
Open model on your hardware$0 marginal$0 marginal

The honest framing: per-token prices are small, and for low-volume design work the frontier bill is trivial. Our entire test campaign would have cost a few dollars. Local is not free either: you pay in hardware, electricity, and operations attention. The economics flip at high volume, thousands of runs a day, or at per-seat scale, where zero marginal cost compounds. And they flip hardest when a model runs on every request instead of once at design time, which is the trap we worked through in the real cost of AI at runtime.

Latency and operations. The frontier API wins on burst throughput and needs no capacity planning. A local model on modest hardware served our tasks in four to eight seconds each, more than adequate for design-time work, but someone has to own the server. Local inference has become dramatically easier, an afternoon rather than a quarter, but it is not zero.

Licensing. If you go local, prefer permissively licensed weights, Apache 2.0 among others. When procurement asks whether you can use this commercially, on-premise, with no strings, that is the clean answer.

The conclusion: determinism gets you both worlds

Here is the pattern that dissolved the either-or for us. Most enterprise data work does not need a model at runtime. It needs a model once, at design time, to write the transformation, the report generator, the pipeline. After that, the correct move is to freeze the verified code and run it deterministically forever: same input, same output, no model in the path, no per-run token cost, nothing to drift or hallucinate on run ten thousand. That is the same argument we made for reporting in are BI tools dead, and it applies to every task in our test suite.

Once you adopt that architecture, frontier versus local stops being an identity and becomes a per-task routing decision:

The verification harness is what makes this safe, and it is model-agnostic by construction. Ours caught the same bugs whether the code came from a 2024 open model, a 2026 open model, or a frontier API. That is also your insurance policy against a fast-moving market: when a better model ships next quarter, you swap the engine and keep the guardrails, the workflows, and the audit trail. This is how Build Studio is built, and it is why the model dropdown is a configuration detail there rather than a commitment.

Bottom line

Do not pick a side. Pick a verification-first architecture, route each task to the cheapest model that passes verification under your data constraints, and freeze what passes. The models will keep changing. That decision framework will not.

If you want to see the pattern on your own data, build your first workflow against a task you already know the right answer to. That last part is the whole trick: always test AI against work you can check.

Methodology notes: five tasks with independently computed ground truth; generated code executed and graded automatically; local models run quantized on a desktop-class machine; retry loops capped at eight rounds; "pass" means exact agreement with precomputed answers. Pricing as published by Anthropic and Google at the time of writing; check the current pages before budgeting.

Back to Resources