Don't trust the cloud — when local-first AI is the right answer

A friend of mine runs operations at a 200-person law firm. Last year they signed a $12,000-per-month contract with an AI vendor for a "secure private cloud" version of a popular legal-AI product. This year their account manager called to let them know that, due to model-provider pricing changes, the same usage would now cost $19,500 per month. Take it or leave it.

That's the problem with renting your AI. Eventually the rent goes up.

The pitch you've heard a thousand times

Most AI consultancies will sell you a system that works like this. They wire your app to OpenAI (or Anthropic, or Google). They put a thin abstraction over the API call so the model is "swappable." They use one of three popular vector databases. They deploy on AWS or GCP. The whole thing is up in six weeks. It works pretty well at first.

Then a few things happen, on a timeline that varies but always arrives:

The model provider changes their pricing. Not always up — sometimes down, in fact — but usually with a structural shift that breaks your forecasted unit economics. The "swappable" abstraction turns out to be a leaky abstraction; switching providers requires re-tuning prompts, re-validating outputs, re-running evals, and the engineering team you don't have.

The model provider deprecates the model you've built on. They give you nine months. You spend three of them in denial, three more re-evaluating, and three frantically migrating before the cutoff date.

Your usage grows in a way that surprises everyone. What started as "a hundred queries a day" becomes "twenty thousand queries a day" because the feature became popular. Your bill grows in proportion. Sometimes more than in proportion, because heavier users tend to write longer prompts.

Compliance starts asking questions. "Where does this data go? Whose servers? What's the retention policy? Can you prove deletion? What happens if there's a breach at the model provider?" Each of these is answerable, but the answers depend on someone else's policies, not yours.

None of these are reasons to never use a hosted model. They are reasons to think about it carefully before you build your foundation on rented land.

The local-first alternative, plainly

"Local-first AI" means: the model that does the work runs on hardware you control — your servers, your client's servers, on-premise, in a colocation facility, anywhere that isn't a model vendor's API endpoint. The model itself is open-weight (Llama, Mistral, Qwen, BGE, others) so you can run it without anyone's permission and you can keep running the same version forever if you want.

The data never leaves your network. The model can't be deprecated underneath you. Your unit economics are dominated by capital cost, not variable cost.

This is not a new idea. It's how most software was built before SaaS won the platform wars. We're not arguing against the cloud — we're arguing for choosing the right deployment model for the workload.

When local-first is the right answer

In our experience, the local-first decision becomes obvious in five situations:

1. The data is sensitive in a regulated way.

Legal documents, medical records, financial data, government records, anything covered by HIPAA, SOC 2, GDPR data residency, or industry-specific equivalents. Once compliance is in the conversation, the cost of getting a vendor approved exceeds the cost of building the thing yourself. We've seen 12-week vendor approval cycles that ended in "no." Twelve weeks is most of a build.

2. Your unit economics make per-token pricing prohibitive.

If your product makes ten LLM calls per active user per day, and you have a hundred thousand active users, you're at a million LLM calls a day. Pick your favorite frontier model and do the math. For most B2B SaaS, this is the moment local-first goes from "nice idea" to "the only sustainable answer." The bill is non-negotiable. Your customer's pricing is.

3. You need version stability.

Some products — legal tools, medical decision support, regulated reporting — need the model to behave the same way next year as it does this year. You cannot have the AI vendor silently update your "GPT-4" to a new variant that subtly changes outputs. Open-weight models you've pinned to a specific commit don't have this problem.

4. The data is large and the queries are batchy.

Document ingestion, log analysis, periodic reporting — anything that generates big bursts of LLM traffic against your own data. Per-token pricing for batch workloads gets ugly fast. We did the math on the CBIC RAG ingest: at frontier-model embedding prices, the one-time ingest of a national-scale legal corpus would have cost more than the entire rig.

5. You want to actually understand what's happening.

This one's softer but real. With your own model on your own hardware, you can profile, trace, evaluate, and improve in ways that an HTTP API simply doesn't permit. The system is observable end-to-end. You can swap retrievers, tune chunkers, fine-tune the model on your domain. You own the stack.

When it's not the right answer

To be honest about it: local-first is the wrong call in three situations.

You need cutting-edge frontier capability. If your product literally requires GPT-5-level reasoning today, no open-weight model will match it. Build on a frontier API for that capability, but isolate the dependency so you can swap when the open-weight ecosystem catches up (which it does, predictably, about 12-18 months later).

The traffic is genuinely sporadic and small. If you have a hundred queries a day, the operational cost of running your own infrastructure exceeds the API bill. Use the API.

You don't have anyone who can operate the infrastructure. Local-first means owning the operational burden. If your team isn't equipped for it (and doesn't want to learn), the API is the right call regardless of cost. The break-even point gets worse, but the alternative is a system nobody can keep running.

The cost math that surprised us

Here's the comparison we ran for our own studio inference workload — the kind of mix we'd expect a mid-size B2B AI feature to have. Five thousand embedding calls per day, fifteen hundred chat completions per day, sustained for one year.

Item	Cloud (frontier APIs)	Local-first (our rig)
Hardware (one-time)	near-zero	low-thousands
Embedding calls (year)	low-thousands	near-zero
Chat completions (year)	mid-four-figures	near-zero
Electricity (year)	near-zero	low-three-figures
Operational time (estimate)	low-four-figures	low-four-figures
Year 1 total	mid-four-figures	mid-four-figures
Year 2 total	mid-four-figures	low-four-figures

Two things stand out. First, the local-first option wins in year one even at modest scale — not by much, but it wins. Second, year two is where local-first separates from the pack: hardware is paid off, only running costs remain. By year three, the cumulative cost gap is roughly 2×.

And these numbers understate the local-first advantage at scale. We assumed flat usage. Real businesses grow. Cloud bills grow with usage. Hardware bills don't.

Caveats on the math: we're using our actual costs for our actual hardware, in our region, with our power prices. Your numbers will differ. But the shape of the curve — capital cost up front, then near-zero marginal cost — is the same regardless of region.

What to do about it

If you're building or commissioning an AI feature in 2026, three concrete suggestions:

Run the math on your own usage assumptions, including growth. Don't trust vendor calculators. Build a spreadsheet with reasonable cells for "queries per user per day," "users in year 1, 2, 3," and "model cost per 1k tokens." See where local-first crosses cloud for your numbers. It might be earlier than you think.

Architect for portability from day one. Whatever you build first, make sure the model is behind an interface that can swap. Even if you start with a hosted API for speed, design so a local model can drop in later. The cost is small at the start; the cost of not doing this is large six months in.

Get serious about open-weight quality. The open-weight ecosystem is much further along than the marketing of frontier vendors would have you believe. BGE-M3 is a world-class embedder. Qwen3, Llama 3.3, Mistral Large — these models are production-grade for most tasks that don't require the absolute frontier of capability. Try them on your workload before you assume you need a hosted model.

The studio's own choice

This is what we run. The seven-GPU rig in our basement. Open-weights embedder, open-weights synthesis LLM. Open weights, on hardware we own, with no per-token bill and no vendor whose price increase determines our margin.

It's not because we're ideologues. It's because we ran the math, and the math said this was the right answer for the work we do. For some clients we still recommend cloud APIs — when the workload is small, the data isn't sensitive, and the team can't operate infrastructure. For most of the rest, local-first wins.

If you're sitting with a quote from an AI consultancy and the recommended architecture is "we'll wire you to the OpenAI API" — ask them what happens to your monthly bill at 10× usage. Ask them what happens if the model is deprecated. Ask them whether the data leaves your network. Then come back here and we can have the conversation about the alternative.

If you'd like a clear-eyed read on whether local-first is right for your situation — or whether the cloud is genuinely the better answer for you — that's the kind of conversation our 30-min calls are for. Book one.