Picking an inference runtime for production

The runtime that's optimised for the demo is not the runtime you should ship with. A short note on why we left the friendly default behind, and what we learned about matching the tool to the workload.

Most local-AI tutorials start the same way. Install one tool, pull a model, type a question, watch the answer stream back. Five minutes, working chatbot, you're sold.

The trouble is what happens next. You commit. You write the integration. You build a real workload on top of the friendly default — and somewhere between the proof-of-concept and the production run, the runtime that made the demo so easy starts costing you in places the demo never touched.

/ The thesis

A runtime tuned for the chat-window demo is not the same runtime you should put under a batch pipeline. They look the same on day one. They diverge violently the moment the workload changes shape.

This is a quick field note about what that divergence looks like, and the mode-shift we made when we hit it.

Two shapes of work

It helps to draw the line clearly. There are two very different things people mean when they say “we're running an LLM in production.”

/ Shape A

Interactive

One human, one question, one answer. Latency to the first token is everything because somebody is staring at the screen. Throughput is almost irrelevant — there's only ever one query in flight per user. Most demos live here. Most product surfaces live here.

/ Shape B

Batch

Nobody is waiting. The job is to push thousands of documents through an embedder, score a corpus against a reranker, or run an evaluation sweep overnight. Throughput is everything. First-token latency on any single request is irrelevant — what matters is total time to finish the queue. The right answer is high concurrency, predictable batching, and a runtime that doesn't lie about whether the GPU is actually busy.

An interactive runtime is built around a chat session: keep a model warm, hold a conversation, stream tokens to a UI. Batch is the opposite — you want to hand the runtime a queue and have it stay busy. The same model weights can serve either; the engine wrapped around them cannot.

If you commit to an interactive runtime because the chat demo was friendly, then ask it to absorb a corpus-ingest workload, you discover that “friendly” was not the same word as “fast.” Convenience features that make sense for a chat session — model auto-unloading, single-process serialisation, hidden defaults that prefer responsiveness over throughput — quietly become the bottleneck. None of that shows up in the demo. All of it shows up in the batch.

The mode-shift

The friendly runtime was the obvious choice when we were prototyping. The first time we put a real workload through it — corpus-scale, no human in the loop, every document needing a fingerprint — we watched throughput collapse and the GPU utilisation hover at numbers that made no sense for the size of the queue we were feeding it.

So we left.

The fix wasn't a different vendor or a different model. It was dropping a layer. We stopped talking to a server-with-a-chat-UI-on-top and started loading the same open-weight models in-process, with the runtime configured for the workload we actually had: deterministic batches, controlled concurrency, predictable memory residency, no auto-management, no helpful defaults we hadn't asked for.

/ The shift

From a runtime optimised for “one user, one chat, one model in memory” to a runtime optimised for “a queue, a fleet, and nobody waiting.” Same weights. Different engine. Order-of-magnitude difference.

The interactive runtime is still the right tool — for the workload it was designed for. We use it, in places, where a single human is on the other end of the connection. We just don't use it as the substrate for batch work, because that was never what it was built to be good at.

The runtime that makes the demo look effortless is rarely the runtime that makes production look effortless. They optimise for different days.

The lesson, abstracted

This isn't really a story about one runtime versus another. It's a story about a class of mistake that's almost too easy to make: picking a tool because the introduction was smooth, then resenting the tool when it turns out to be a bad fit for the work you ended up doing.

/ The rule

Match the tool to the workload, not to the demo.

It generalises. The most popular database is not always the right database for a write-heavy ledger. The most quoted message bus is not always the right one for low-latency control traffic. The most beloved web framework is not always the right one for a streaming-heavy service. In every one of these cases, the default is popular for a reason — usually because it makes the easiest version of the problem easy. The trouble is that the version of the problem you actually have is rarely the easiest one.

The discipline is to ask, before you commit, what the workload looks like in steady state — not in the first ten minutes. Single user or many? Latency-bound or throughput-bound? One process or a fleet? Predictable load or spiky? Then pick the tool that's good at that, even if its onboarding is rougher.

It usually is rougher. Tools that are honest about their workload tend to be less ingratiating on day one. They expect you to know what you're doing. That's a feature, in production. It's a bug only in the demo.


If you're picking an inference stack for a workload that doesn't look like a chat demo — corpus ingest, periodic batch evals, a fleet of small models behind one router — that's the kind of design conversation our 30-min calls exist for. Book one.