I've been kicking this around in my head for a while. If I want to run LLMs locally, a decent GPU is really the only important thing. At that point, the question becomes, roughly, what is the cheapest computer to tack on the side of the GPU? Of course, that assumes that everything does in fact work; unlike OP I am barely in a position to understand eg. BAR problems, let alone try to fix them, so what I actually did was build a cheap-ish x86 box with a half-decent GPU and called it a day:) But it still is stuck in my brain: there must be a more efficient way to do this, especially if all you need is just enough computer to shuffle data to and from the GPU and serve that over a network connection.
We're not yet to the point where a single PCIe device will get you anything meaningful; IMO 128 GB of ram available to the GPU is essential.
So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
There is plenty that can run within 32/64/96gb VRAM.
IMO models like Phi-4 are underrated for many simple tasks.
Some quantized Gemma 3 are quite good as well.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
IDK all of my personal and professional projects involve pushing the SOTA to the absolute limit. Using anything other than the latest OpenAI or Anthropic model is out of the question.
Smaller open source models are a bit like 3d printing in the early days; fun to experiment with but really not that valuable for anything other than making toys.
Text summarization, maybe? But even then I want a model that understands the complete context and does a good job. Even things like "generate one sentence about the action we're performing" I usually find I can just incorporate it into the output schema of a larger request instead of making a separate request to a smaller model.
It seems to me like the use case for local GPUs is almost entirely privacy.
If you buy a 15k AUD rtx 6000 96GB, that card will _never_ pay for itself on a gpt-oss:120b workload vs just using openrouter - no matter how many tokens you push through it - because the cost of residential power in Australia means you cannot generate tokens cheaper than the cloud even if the card were free.
> because the cost of residential power in Australia
This so doesn't really matter to your overall point which I agree with but:
The rise of rooftop solar and home battery energy storage flips this a bit now in Australia, IMO. At least where I live, every house has a solar panel on it.
Not worth it just for local LLM usage, but an interesting change to energy economics IMO!
- You can use the GPU for training and run your own fine tuned models
- You can have much higher generation speeds
- You can sell the GPU on the used market in ~2 years time for a significant portion of its value
- You can run other types of models like image, audio or video generation that are not available via an API, or cost significantly more
- Psychologically, you don’t feel like you have to constrain your token spending and you can, for instance, just leave an agent to run for hours or overnight without feeling bad that you just “wasted” $20
- You won’t be running the GPU at max power constantly
This is simply not true. Your heuristic is broken.
The recent Gemma 3 models, which are produced by Google (a little startup - heard of em?) outperform the last several OpenAI releases.
Closed does not necessarily mean better. Plus the local ones can be finetuned to whatever use case you may have, won't have any inputs blocked by censorship functionality, and you can optimize them by distilling to whatever spec you need.
Anyway all that is extraneous detail - the important thing is to decouple "open" and "small" from "worse" in your mind. The most recent Gemma 3 model specifically is incredible, and it makes sense, given that Google has access to many times more data than OpenAI for training (something like a factor of 10 at least). Which is of course a very straightforward idea to wrap your head around, Google was scrapign the internet for decades before OpenAI even entered the scene.
So just because their Gemma model is released in an open-source (open weights) way, doesn't mean it should be discounted. There's no magic voodoo happening behind the scenes at OpenAI or Anthropic; the models are essentially of the same type. But Google releases theirs to undercut the profitability of their competitors.
This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.
> Asus made a crypto-mining motherboard that supports up to 20 GPUs
Those only gave each GPU a single PCIe lane though, since crypto mining barely needed to move any data around. If your application doesn't fit that mould then you'll need a much, much more expensive platform.
In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.
Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most.
Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
Not sure what was unexpected about the multi GPU part.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
> Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.
Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.
I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.
> Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities.
This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.
Datapoints like this really make me reconsider my daily driver. I should be running one of those $300 mini PCs at <20W. With ~flat CPU performance gains, would be fine for the next 10 years. Just remote into my beefy workstation when I actually need to do real work. Browsing the web, watching videos, even playing some games is easily within their wheelhouse.
As experiment, I decided to try using proxmox VM with eGPU and usb bus bypassed to it, as my main PC for browsing and working on hobby projects.
It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.
Switching from my 8-core ryzen minipc to an 8-core ryzen desktop makes my unit tests run way faster. TDP limits can tip you off to very different performance envelopes in otherwise similar spec CPUs.
I really would have liked to see gaming performance, although I realize it might be difficult to find a AAA game that supports ARM. (Forcing the Pi to emulate x86 with FEX doesn't seem entirely fair.)
Now compare batched training performance. Or batched inference.
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.
Source code: https://github.com/BinSquare/inferbench
So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
Smaller open source models are a bit like 3d printing in the early days; fun to experiment with but really not that valuable for anything other than making toys.
Text summarization, maybe? But even then I want a model that understands the complete context and does a good job. Even things like "generate one sentence about the action we're performing" I usually find I can just incorporate it into the output schema of a larger request instead of making a separate request to a smaller model.
If you buy a 15k AUD rtx 6000 96GB, that card will _never_ pay for itself on a gpt-oss:120b workload vs just using openrouter - no matter how many tokens you push through it - because the cost of residential power in Australia means you cannot generate tokens cheaper than the cloud even if the card were free.
This so doesn't really matter to your overall point which I agree with but:
The rise of rooftop solar and home battery energy storage flips this a bit now in Australia, IMO. At least where I live, every house has a solar panel on it.
Not worth it just for local LLM usage, but an interesting change to energy economics IMO!
- You can use the GPU for training and run your own fine tuned models
- You can have much higher generation speeds
- You can sell the GPU on the used market in ~2 years time for a significant portion of its value
- You can run other types of models like image, audio or video generation that are not available via an API, or cost significantly more
- Psychologically, you don’t feel like you have to constrain your token spending and you can, for instance, just leave an agent to run for hours or overnight without feeling bad that you just “wasted” $20
- You won’t be running the GPU at max power constantly
The recent Gemma 3 models, which are produced by Google (a little startup - heard of em?) outperform the last several OpenAI releases.
Closed does not necessarily mean better. Plus the local ones can be finetuned to whatever use case you may have, won't have any inputs blocked by censorship functionality, and you can optimize them by distilling to whatever spec you need.
Anyway all that is extraneous detail - the important thing is to decouple "open" and "small" from "worse" in your mind. The most recent Gemma 3 model specifically is incredible, and it makes sense, given that Google has access to many times more data than OpenAI for training (something like a factor of 10 at least). Which is of course a very straightforward idea to wrap your head around, Google was scrapign the internet for decades before OpenAI even entered the scene.
So just because their Gemma model is released in an open-source (open weights) way, doesn't mean it should be discounted. There's no magic voodoo happening behind the scenes at OpenAI or Anthropic; the models are essentially of the same type. But Google releases theirs to undercut the profitability of their competitors.
* https://jabberjabberjabber.github.io/Local-AI-Guide/
> Asus made a crypto-mining motherboard that supports up to 20 GPUs
https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...
For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.
Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.
Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.
I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.
This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.
It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.
Keeps the desk nice and tidy while “the beasts” roar in a soundproofed closet.
That would help in latency-constrained workloads, but I don't think it would make much of a difference for AI or most HPC applications.
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.