Most AI products we get asked to inherit have one thing in common: an LLM call where there should be a function. Somebody, at some point, decided to ask the model to score a candidate, or rank a search result, or compute a tax estimate, or pick a workflow branch. The product worked, sort of. Then it went to production and the same input started returning different answers on different days, and the support tickets started, and nobody could quite explain what was happening.
This is a solved problem. The solution is unsexy: don't put AI in the place where you needed code. The discipline that makes this easy to do consistently is the one we want to talk about here.
The shape of an honest AI system
Look at any of the production systems the studio runs and you will find the same shape. The AI is at the start of the pipeline, where the input is messy — a free-form document, a half-structured conversation, a paragraph someone typed at midnight. The AI's job is to turn that into structured data. After that, the structured data flows through deterministic code: scoring, ranking, deduping, formatting, dispatching.
The job agent is one example. The candidate uploads a CV. An LLM extracts the structured fields — companies, durations, claims, key skills. From that point on, every decision the system makes — does this job match, what's the score, which jobs to surface, what to write in the cover-letter scaffold — is deterministic, versioned, repeatable. The same CV against the same job listing scores the same number every time, no matter who runs it or when.
The QA harness is another. The fixture state is loaded by deterministic code. The verification checks that follow each test step are deterministic — element exists, value equals, network call returned 200. The AI's job is upstream: read a story, generate the test scenarios that exercise it. The scenarios are then frozen as code.
The shop-booking flow is the same shape. The customer scans a QR code; the routing is deterministic. The mechanic's screen shows the next car; that ordering is deterministic. The assistant the mechanic can ask "what's the labour estimate on a 2018 Civic clutch replacement" — that's an AI call, because the input is a free-form question. The thing the mechanic types into a structured field is not.
The obvious point
Deterministic systems are auditable. AI systems are not — at least, not in the same way. A function that takes a candidate record and returns a score is something a human can read, reason about, change deliberately, and version. A prompt that takes the same record and returns a number is a black box that has feelings about the weather.
This matters because most useful software exists to make a defensible decision. The score that surfaces the right job to the right person. The match that pairs the customer with the right mechanic. The retrieval that returns the right paragraph from the right document. If you cannot explain — to your user, to a regulator, to yourself at three in the morning — why the system produced the answer it did, you have built something fragile.
The honest version of this argument has a corollary that the AI-everywhere crowd doesn't always say out loud: most of the time, the LLM is not actually adding value. It's adding the appearance of intelligence to a step that already worked, and pricing risk into a step that didn't need it.
The subtle point
The subtle point is the one that took us longer to internalise, and it's the part that determines whether your architecture stays clean over time.
AI is most valuable where the input is genuinely ambiguous. That's a tighter rule than "where the work is hard." Lots of work is hard but unambiguous. Computing tax interest on a notice is hard, but the inputs are structured and the formula is in a statute. You don't want an LLM near it. The work is hard for humans because the formula is intricate; for code, it's a function.
On the other hand, reading the body of an unstructured customer email and figuring out which of seventeen issue categories it falls into — that's actually ambiguous. There is no formula. Two people would disagree on the boundary cases. The input is genuinely shaped like natural language, and an LLM will outperform any rule-based classifier you could write in a quarter.
The discipline, then, is to ask of every step in the pipeline: is the ambiguity in the input, or am I just being lazy about writing code? If the answer is honest — "the ambiguity is real, the input genuinely is free-form text, no rule-based system would do this" — then AI is the right tool. If the answer is "I just don't want to write the rules" — then AI is the wrong tool, and you'll find out the day it returns a different score on the same input.
"AI is for genuinely ambiguous inputs. Everything else is deterministic, or you're paying for a stochastic model where a function would do."
Where this shows up in our work
We can name a few specific decisions, abstracted enough to be portable:
The job-agent scorer is not an LLM. The LLM extracts the candidate profile and the job posting into structured fields. The scorer that combines them is a rubric in code, with named factors and weights. It is versioned. The same candidate-job pair returns the same score this week as next week. When a hiring manager asks "why did you surface this job," there is an answer with names and numbers.
The retrieval system isn't ranked by an LLM. It uses an embedder, a deterministic similarity calculation, a cross-encoder reranker, and a final groundedness check by an LLM. The judgment moment — "is this evidence actually about the question?" — is where the LLM lives. The ranking arithmetic upstream is not.
The shop's daily ordering is deterministic. Which car gets worked on next, which mechanic is on which bay, which customer is overdue for a follow-up — these are all the kind of decisions you'd hate to explain as "the model said so." They're the kind of decisions a manager wants to look at in a table and overrule when needed. So the table exists, and it's built by code.
The QA scenarios are generated, then frozen. An LLM proposes the test scenarios for a story. A human reviews them. From the moment they're approved, they're code — they run the same way every time, against the same fixtures, with the same assertions. The AI did its job upstream; the floor is deterministic.
None of this is glamorous. None of it is the kind of thing an AI vendor's marketing site would feature. It's also why these systems work in production for users who have real problems and limited patience for surprises.
What the structured-vs-ambiguous test feels like in practice
If you want a quick gut check, the test is: could you write the rules? Not "would it be easy" — could you, given enough time, write a function that takes the input and returns the right answer? If yes, the right tool is the function. If "no, the input is too varied / too natural / too contextual" — the right tool is AI.
"Too varied" is the most common honest answer when AI is genuinely the right call. There are five hundred ways a customer can phrase a refund request. There are infinite ways someone can describe a symptom to their doctor. There are no two CVs structured the same. In those cases, no rule system will keep up. AI's job is to absorb the variation and emit something structured.
"I just don't want to write the rules" is the most common dishonest answer when AI is the wrong call. We've caught ourselves doing it. Refactoring AI out of a system, after the fact, is one of the more common improvements we make on inherited codebases.
The cost-and-failure consequence
This boundary isn't aesthetic. It has direct consequences for cost, speed, and what happens when the system breaks.
Cost. An LLM call is many orders of magnitude more expensive than a function call. If you're paying per token to do something a `for` loop could do, you're paying a tax for the appearance of intelligence. At scale, that tax dominates your unit economics.
Speed. An LLM call is also many orders of magnitude slower than a function call. Code that runs in a millisecond is replaced by an API round-trip that runs in a second, plus the model's own inference latency. For systems where humans are waiting, that's the difference between a tool people use and a tool they don't.
Debuggability. When a deterministic system gets the wrong answer, you can trace it. The bug is in a line of code; you fix it; you write a test; it stays fixed. When an AI system gets the wrong answer, you can change the prompt and hope. We have spent more debugging hours on prompt regressions than on any other category of bug, and we're not unique.
What we tell clients
If a vendor or a consultancy is showing you an architecture diagram that is mostly LLM calls — every box on the diagram is a different prompt — ask them where the deterministic layer is. If they say "we put everything through the model for flexibility," that's the answer that sounds smart and isn't. The flexibility you actually want is the kind that comes from owning the structure: changing a rubric weight is easy and explainable; changing a prompt is a dice roll.
What you're looking for, instead, is an architecture where the AI calls are clearly marked, clearly few, clearly justified by a genuinely-ambiguous input — and where everything around them is code you can read. That's what the studio's systems look like. It's the shape that survives contact with production.
If you're inheriting an AI system and the architecture diagram has too many model calls in it — that's the kind of conversation our 30-min calls are for. Book one.