A 10 year old Xeon is all you need

(point.free)

61 points | by cafkafk 2 hours ago

12 comments

cafkafk 2 hours ago
Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers.
I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
[-]
- fragmede 2 hours ago
  (purple on black is really hard to read)
  You say it runs "at reading speed". Have you benchmarked it?
  [-]
  - cafkafk 2 hours ago
    > (purple on black is really hard to read)
    Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
    > You say it runs "at reading speed". Have you benchmarked it?
    At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
    llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
    Gives:
```
  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens
```
    So 11.94 tokens per second while it's also playing binary cache and CI builder.
    When I do it properly, I'll add it to the blog as well!
    [-]
    - anon-3988 16 minutes ago
      I am pretty sure llamacpp have their own benchmarking binary that you can use.
- arpinum 13 minutes ago
  [dead]
gigatexal 1 minute ago
What kind of tokens per second did the op get I saw nothing of this written.
vhaudiquet 45 minutes ago
The E5 2620-v4 only supports DDR4.
NSUserDefaults 40 minutes ago
How about the iMac Pro? Would that work? I was able to put 128gb in it (not as easy as the regular iMac but possible).
[-]
- wazoox 14 minutes ago
  I've been running various models on a Mac Pro 2013 (8 cores, 32 GB RAM) at about 8 to 10 t/s for months. It's not fast, but it's more than enough for many actual tasks, in particular background tasks. An iMac pro will do just as well I suppose.
asimovDev 1 hour ago
I have an ancient DDR3 Xeon that doesn't support any AVX (dual x5690 and 96GB 1333 MHz RAM). You reckon it would even build / run at all?
[-]
- qwertox 44 minutes ago
  CPU (2012)
```
  Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
```
  Mainboard
```
  Product Name: P8Z77 WS
```
  GPU
```
  05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
  05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
```
  Memory: 32GB
  This works.
- cafkafk 59 minutes ago
  Loading will take some minutes, but at 96 you can squeeze the model in and have some headroom around like ~10 GB, although depending on the Xeon, you may have to downgrade to E4B instead. Should still work thou.
- tgtweak 59 minutes ago
  It may work - depending on your ram speeds it might not even be that much slower.
hypfer 2 minutes ago
> The argument for speculative decoding is stronger on CPU than on GPU.
Uh. Uuuh.
No?
potus_kushner 2 hours ago
@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?
[-]
- cafkafk 1 hour ago
  Honestly, at this point you're probably looking at a smaller model, for the Gemma series I'd go with Gemma 4 E4B with drafters, but that's just a hunch from using it on my laptop (where I do have a RTX 4060 M and 96gb ram).
  So you'd change the invocation slightly here, but a lot of things you can potentially reuse.
  That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.
  [-]
  - potus_kushner 1 hour ago
    [dead]
bflesch 16 minutes ago
Might consider going for even older CPUs which don't have the Intel ME ring -3 thing which is full of backdoors
hparadiz 23 minutes ago
I'm now staring at a 10 year old 4U with 256 GB of DDR4 and thinking hmmmmm
christkv 1 hour ago
Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.
[-]
- Havoc 41 minutes ago
  In general you’re mem bandwidth constrained so cpu vs gpu often ends up similar on APUs
- cafkafk 1 hour ago
  If you get the inference engine to route the heavy matrix math to the GPU and the speculative drafting to the CPU without choking on latency it's probably gonna be very fast.
  Would love to see the benchmarks if someone actually pulls something like that off.
Eonexus 2 hours ago
I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?
[-]
- cafkafk 1 hour ago
  That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server.
  Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.
  [-]
  - Eonexus 1 hour ago
    Huh, that's actually not bad at all! Sure, it's not at the speed of a GPU, but still, 20 tps is cromulent for a CPU.
nurettin 27 minutes ago
I also run a Qwen 3.6 moe A4B on old hardware. I set it up with
numactl --membind=1
so it is constrained to one of the memory sticks which speeds up token generation a little.