The previous era's mining rigs are this era's AI rigs.

A stack of used graphics cards in a basement runs production AI for everything we ship. Bought during the crypto-mining boom, written off long before the transformer wave. Old by GPU standards. Productive by AI standards. Here's what we learned getting them there.

Editorial sketch (not photo, not 3D) of a small home-built GPU rig: a few cards in a frame with cabling visible. Hand-drafted thin ink lines, restrained.
A small home-built rig in the studio's voice.

A few years ago, AMD's high-end consumer graphics cards were going for $700 to $1,000 each, and the people buying them at that price weren't gamers. They were buying by the rack, hooking them up on mining frames with riser cables fanning out, earning back the price in cryptocurrency on a timeline that — at the time — made the math work.

Then the math stopped working. Mining returns collapsed. The cards, in basements and garages everywhere, mostly stopped paying for themselves and started gathering dust. Ours didn't. They got rerouted onto a different problem.

"What was bought as a depreciating mining asset is now running production AI for legal-grade retrieval, daily-run job-search agents, and our other client builds. The cards are old by GPU standards. They're still doing real work, and the depreciation is mostly behind them."

This is the story of the rerouting. The hardware bet, the runtime choice, the six operational lessons it took us months to learn — and why the same pattern can work for any team with a regulated workload that can't go to a third-party API.

What the rig actually does today

An on-prem fleet of cards, each playing one of three roles - embed pool, reranker, synthesis - coordinated by a custom shell.
Many cards. One shell. Three roles.
A terminal-style rig status surface showing live throughput, per-card load, the model loaded on each card, power draw and rolling logs.
The studio's control surface over the fleet.

It's a stack of AMD consumer graphics cards from the Navi-10 era, plus one bigger card with more VRAM, mounted on a mining frame, hooked into a Linux box, sitting on a fixed address on the studio LAN. From the outside it's an HTTP endpoint. From the inside it does three jobs that every retrieval-augmented system needs.

/ role 01

Embedder pool

A multi-card cluster that turns documents and questions into numerical fingerprints, so a question can be matched semantically to the passages most likely to answer it. This is the bulk workload during ingestion — and the line item that dominates the AI bill of any organisation working at corpus scale.

/ role 02

Reranker

A smaller, sharper model whose only job is to look at a question paired with each candidate passage and decide which passages are actually the right ones for this specific question. Runs on its own card so it doesn't fight the embedder for resources.

/ role 03

Synthesis & classifier

The largest model in the fleet. Composes the answer once the right passages have been retrieved, classifies queries into routing categories, and judges whether a question is answerable from the retrieved evidence — refusing cleanly when it isn't. Lives on the one card with enough VRAM to hold it.

A request from any client project enters as an HTTP call, fans out to the right card for each step, and returns an answer with citations — usually in a few seconds, all of it on hardware in a single home setup. CBIC RAG, the Pharma CI Job Agent, OpenClaw, the studio's own concierge — none of them has ever sent a query to a third-party LLM vendor.

Match the runtime to the card

The most counter-intuitive thing about running a fleet of mixed-era consumer GPUs is that the official, vendor-blessed AI stack is not the right answer for all of them.

The bigger card runs cleanly on the AMD-supported AI stack — it's close enough to the lineage of the cards the vendor sells to data centres that the official path works. We use it.

The older cards are a different generation. The official AI stack either failed to install cleanly on their kernel, or installed and then silently fell back to the CPU under load — producing shape-correct outputs that looked fine in logs while the indexes they were writing were useless. We caught it because we were checking GPU memory residency, not because the logs ever complained.

The fix wasn't a different vendor. It was descending one layer further: running inference through the underlying graphics API — the one designed for video games — as the compute backend instead. We set a Mesa driver flag in the systemd unit, picked an open-source inference runtime that talks to that API directly, and the cards have been productive ever since.

"Different cards, different runtimes. The bigger card is happy on the AI stack. The older cards run cleanly on the graphics-API path. We treat each card as its own thing, keep a per-card configuration table, and stop fighting the runtime to fit one stack."

Six lessons from running this rig

An idle, working, idle lifecycle for the fleet: cards rest until needed, warm explicitly, take traffic, drain, and rest again.
Idle. Working. Idle. The mechanical discipline that pays the rig back.

Every one of these started as a bug, an outage, or a moment of "wait, that can't be right." Each one is now codified — with a mechanical guard somewhere that prevents us from forgetting.

Match the runtime to the card

When a vendor's "supported" stack and a community's "we tested this on the actual hardware" stack disagree, the second one is often more honest. On heterogeneous consumer cards, "supported" means "supported on the hardware the vendor sold to data centres." Treat each card as its own thing, keep a per-card tuning table, let the routing layer read from it.

Codified is not implemented

A rule written into a doc is a rule until the next session, the next assistant, or the next refactor silently undoes it. The fix is mechanical: a sentinel value at the top of a file that aborts pre-flight if missing. A configuration profile pre-flight refuses to deviate from. A startup script that exits if the wrong number of cards are detected. The doc says what should be true; the script enforces it. There is no other reliable way to keep a long-running build honest.

Multi-GPU is throughput, not latency

The single most common confusion when founders evaluate a rig. Hearing "we have a seven-GPU fleet" makes them assume single-query answers come back seven times faster. They don't. More cards means more parallel work — more queries served per second across the fleet. A single answer still depends on one card running one model end-to-end. If single-query latency is the bottleneck, the fix is a smaller model, a shorter output, or a different architecture. Not more silicon. We re-learned this three separate times before codifying it.

Pre-flight checks beat green logs

The most expensive class of bug on this rig has been the "everything reports success and the output is silently wrong" class. The most vivid case: an inference server came up healthy, returned shape-correct vectors with the right dimensions, logged 200 OK on every request — and was producing all-near-zero vectors because it had silently fallen back to CPU. Logs lie when the failure mode is shape-correct content-broken. Every stage now asserts what it produced, not just that it ran. Dimension and norm checks on embeddings. Content checks against a known-good fixture. A GPU-memory check that confirms the model is actually resident in VRAM. Trust the assertion, not the log.

Cost math is real but operational overhead is real too

For sporadic, unpredictable workloads, frontier APIs are the correct answer. For steady-state corpus inference — embedding a regulatory corpus, ingesting a document store, running periodic batch reports — per-token pricing becomes the dominant cost line within months, and a rig of depreciated mining cards pays for itself faster than most spreadsheets predict. The naïve "$X / month on cloud vs. $Y for hardware" comparison misses the other side of the ledger: cold-load discipline, per-card tuning, fleet management. The math still works at corpus scale; it just requires honesty about the operational overhead too.

The question never leaves your network

Even if cost were equal, there is a structural reason to keep LLMs local. Every query you send to a third-party AI vendor is a query they could log, train on, expose in a breach, or be subpoenaed for. For a tax practitioner asking about a specific client, a recruiter scoring a candidate's resume, a law firm looking up a regulation on a live matter — the questions themselves are sensitive, not just the answers. Local-first is the only architecture where the question never leaves the network it originated on. We run this way for that reason, and it's what we help clients set up for their own confidentiality-bound workloads.

Why we don't sell rigs — we help you build yours

This rig is the studio's capability proof, not a thing we sell. We're not in the hardware business. We don't ship racks. We don't take a margin on graphics cards.

What we do is help teams who've decided — for the data-sovereignty reason, the cost reason, or both — set up their AI on hardware they own, in a way that actually works in production. The cards are commodity. The runtime is open-source. The operational discipline — the per-card tuning table, the mechanical pre-flight guards, the role-pinning, the dashboard — is what's hard to copy, and it's the same whether you're running a tax-law system, a clinical-evidence retriever, a confidential recruiting tool, or any other regulated workflow. We've collected the bruises. If your team would rather not collect them too, that's the conversation worth having.