How I Turned an Old Mining Rig into a Local LLM Server with llama.cpp, Vulkan, and Three RX 5700 XTs
A lot of local AI content online assumes you are starting with a shiny workstation, a giant NVIDIA card, and a power supply that could probably run a space heater.
That was not my situation.
I picked up an older mining rig with three AMD RX 5700 XTs, a riser setup, and just 16 GB of RAM. It was cheap, weird, and exactly the kind of machine that makes you think, "This is either going to be awesome or a complete waste of time."
Naturally, I chose awesome.
The goal was simple: turn this old mining box into a local LLM server that could handle extraction and summarization workloads without sending anything to a hosted provider. I wanted to run llama.cpp, use Vulkan so the AMD cards would actually matter, expose an OpenAI-compatible API, and eventually front the whole thing with a load balancer that could queue requests when all GPUs were busy.
The path from "cheap mining rig" to "working inference box" was not simple.
Starting with the Hardware I Had, Not the Hardware I Wanted
The machine came with three RX 5700 XTs, which is not the most common path for local inference builds. Most guides assume CUDA. Most examples assume NVIDIA. Most "easy mode" tooling assumes the same thing.
But the cards were there, the price was right, and I wanted to see how far I could push AMD hardware before giving up.
The first problem was obvious almost immediately: the machine only had 16 GB of RAM.
That was enough to boot Linux, build software, and convince myself I was making progress. It was not enough to comfortably run multiple model servers, Docker containers, larger context windows, and a whole queueing layer without the box starting to feel unstable.
So before the software architecture was even settled, I already had my first real infrastructure lesson:
GPU VRAM is only part of the story. Host RAM still matters a lot.
Compiling llama.cpp with Vulkan
Since I was working with AMD cards, Vulkan was the path that made the most sense.
I built llama.cpp with Vulkan support and started testing models from the CLI. Once that worked, I moved to llama-server because I needed an API surface my other applications could call.
That part felt promising right away. llama-server gave me what I wanted:
- a local API
- OpenAI-compatible routes
- the ability to expose a single endpoint to other services
- a path to multi-GPU experimentation
At this point, the dream was alive.
I had a machine with three GPUs and a model server that could run on AMD without begging for CUDA.
The First Confusing Failure: Context Size
My first real stumble was not GPU-related at all. It was context.
I was testing Qwen models and hit an error where the request token count exceeded the available context size. At first, it looked ridiculous because the prompt I was sending was tiny.
It turned out the problem was that I was forcing a smaller context window than the model expected. Between the model's chat template and the way requests were being wrapped, my "small prompt" was not as small as I thought.
That was a useful reminder that with llama.cpp, the request you see is not always the request that actually gets tokenized.
So I bumped the context size and kept going.
The Second Confusing Failure: The Server Was Running, But I Could Not Reach It
Then I hit networking.
I started llama-server, tried to access it from another machine, and nothing worked. I checked ports, checked firewalls, checked Docker, checked UFW, and kept moving in circles.
The issue ended up being embarrassingly simple: I was curling the wrong IP.
The server was bound correctly, but I was aiming requests at an address that did not actually belong to the box. Once I checked the real LAN IP and bound the server to 0.0.0.0, everything made a lot more sense.
That led to another lesson that sounds obvious in retrospect:
A correctly running service is still unreachable if you are talking to the wrong machine.
Docker Was Convenient—Until It Wasn't
Once I had the server working on the host, I wanted it to survive terminal disconnects and be easy to manage. That led me into Docker and Docker Compose.
In theory, this was a perfect use case:
- containerize the server
- mount the model directory
- pin GPUs
- bring everything up declaratively
In practice, the first few attempts were rough.
I ran into model download issues inside containers, cache behavior I did not fully trust, and some networking confusion while trying to make the containers visible to other services. I also learned quickly that just because a container is "up" does not mean the model inside it has actually finished loading and is ready to answer requests.
The cleanest improvement I made was to stop relying on model downloads at startup.
Instead of pointing the server at a Hugging Face repo every time, I moved to a local GGUF file and mounted that into the container. That removed a whole class of avoidable startup failures.
That one change made the system feel much more like infrastructure and much less like an experiment.
Incident #1: Splitting One Bigger Model Across All Three GPUs
My first serious design was: one larger model, all three cards, one server.
On paper, it sounded elegant.
In practice, it was disappointing.
The model did load across multiple GPUs, but active compute was still heavily biased toward one card at a time. The other GPUs often looked like they were mostly acting as memory holders rather than equal participants in the work. I could see model data in VRAM on multiple cards, but that did not translate into the balanced utilization I expected.
The machine was doing multi-GPU inference.
It just was not doing the kind of multi-GPU inference I had imagined.
That was the point where I stopped thinking in terms of "one server using three GPUs" and started thinking in terms of "three independent servers, one per GPU."
Incident #2: The RAM Ceiling
With only 16 GB of RAM, the box started to get ugly whenever I tried to run too many services at once.
SSH became unreliable. Containers would come up and make the machine feel half-dead. I spent time wondering whether Docker networking had somehow broken the host before realizing the more likely explanation was much simpler:
I was just running out of memory.
This was the real turning point in the project.
I upgraded the machine's memory from the original 16 GB configuration to a larger, faster kit, and suddenly the whole system stopped fighting me.
After the upgrade, the machine had enough breathing room for:
- multiple model servers
- Docker Compose
- larger context sizes
- HAProxy
- monitoring
- and basic system sanity
The upgrade did not magically double token speed, but it absolutely transformed stability.
Looking back, that RAM upgrade mattered more than almost anything else I did.
Why I Moved to Three Smaller Independent Servers
Once I stopped trying to force one larger model across all three GPUs, the architecture got much cleaner.
Instead of one model spanning the whole box, I ran three separate llama-server instances, each pinned to a single RX 5700 XT.
That gave me:
- simple mental model
- simple failure domains
- predictable resource usage
- one active inference per card
- easy routing
I experimented with different Qwen variants and quantizations, and eventually landed on a smaller model family that fit the hardware much more naturally. Rather than chasing theoretical maximum quality with a model that pushed every limit at once, I wanted something that could run reliably, repeatedly, and without making the host feel like it was about to collapse.
That was the difference between a demo and a usable system.
Choosing the Quant: Why Smaller Was Better
One of the most practical lessons in this build was that picking a model is not enough. You also have to pick the right quant.
For this machine, the quant choice had direct consequences for:
- VRAM headroom
- context length
- system stability
- startup behavior
- and how many replicas I could run comfortably
I looked at higher-quality variants, but the sweet spot ended up being the one that let me keep enough headroom for real runtime behavior instead of just barely fitting weights into memory.
That is a pattern I would recommend to anyone building on older consumer GPUs:
Do not optimize for "largest model that technically fits." Optimize for "model that still leaves room for everything else."
Adding HAProxy in Front of the Model Servers
Once I had three backends, I needed a front door.
I did not want my application choosing ports manually. I did not want client code trying to figure out which GPU-backed server was busy. I wanted a single endpoint and simple behavior:
- if a card is free, use it
- if all cards are busy, wait
- do not overload a backend
- do not start multiple inferences on one card at the same time
HAProxy turned out to be a great fit for that.
I configured each backend with a connection limit of one active request and put HAProxy in front to distribute traffic. That gave me a single API endpoint for the application while preserving one-inference-per-card behavior behind the scenes.
The best part was the queueing.
When all three backends were busy, HAProxy did not throw the request away. It queued it. When a backend freed up, the next request moved forward.
That was exactly what I wanted.
Testing the Queue
Once the queueing layer was in place, I tested it the only way that really matters: by sending more work than the system could execute immediately.
The results were exactly what I hoped for.
HAProxy kept the front door open, the backends stayed healthy, and requests waited instead of piling onto a single GPU. The stats clearly showed queue time, total request time, and distribution across the three backends.
That was the moment the build stopped feeling like a pile of individually working pieces and started feeling like an actual service.
Incident #3: Thinking Mode
There was one more thing I had to solve before the setup felt production-worthy: reasoning mode.
Qwen 3.5 was technically answering correctly, but it was also returning reasoning_content in the API response. That meant even a simple prompt like "Reply with only: ok" could come back with a chain-of-thought-style trace attached.
I did not want that behavior in downstream extraction pipelines.
The fix was straightforward once I found the right knob: include chat template settings in the request and disable thinking explicitly.
After that, the responses got much smaller, much cleaner, and much more appropriate for structured automation.
That mattered more than it sounds. When you are feeding another application, "correct but verbose" is often worse than "short and controlled."
What the Final System Looks Like
By the time everything settled down, the architecture looked like this:
- older mining rig chassis
- three RX 5700 XTs on risers
- upgraded system memory
- Linux host
llama.cppbuilt with Vulkan- three separate
llama-servercontainers - one GPU per server
- HAProxy in front
- one public endpoint
- queued traffic when all cards are busy
- thinking disabled for clean API responses
It was not the simplest path to a local LLM setup.
But it was cheap, fun, flexible, and fast enough to be useful.
What I Learned
If I had to boil the whole project down into a few lessons, it would be these:
1. AMD can absolutely work
It is not the most documented path, but llama.cpp with Vulkan made this build viable.
2. Host RAM matters more than people admit
GPU VRAM gets all the attention, but host memory can make or break the system.
3. Bigger is not always better
A smaller model that runs cleanly across three independent servers can be more useful than a bigger one awkwardly stretched across the entire box.
4. Queueing matters
A model server is not just about raw throughput. It is also about what happens when demand exceeds capacity.
5. Reliability beats theoretical maximums
The best setup was not the one that looked most impressive on paper. It was the one that stayed up, answered cleanly, and behaved predictably under load.
Conclusion
I started with an old mining rig, three AMD cards, and a lot of uncertainty about whether the whole idea was even worth the trouble.
Now I have a local inference box that:
- runs real models
- exposes a clean API
- distributes work across GPUs
- queues requests intelligently
- and does it all on hardware that used to live a completely different life
That is one of the things I like most about this kind of project.
You do not always need the newest machine. Sometimes you just need enough patience to figure out what the machine you already have is actually good at.