Local Qwen isn't a worse Opus, it's a different tool

(blog.alexellis.io)

100 points | by alphabettsy 3 hours ago

8 comments

glerk 1 hour ago
If you play with these models long enough, you realize there is more to them than just "model X is smarter than model Y" or "model Y is cheaper than model Z". They are different tools and the prompting technique is different. It is very much like playing an instrument.
With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.
With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.
With Qwen, you have to give it a shape and let it fill it in. Qwen likes XML, JSON and lists. Qwen likes to be shown a bunch of examples of previous work.
This is not scientific at all, just vibes, YMMV.
[-]
- stingraycharles 1 hour ago
  I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.
  What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.
  [-]
  - dv35z 28 minutes ago
    What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.
  - sanderjd 1 hour ago
    I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.
- vkazanov 55 minutes ago
  The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.
- hashmap 12 minutes ago
  totally true. one key for claude is to not smell like an evaluator, its good at knowing when its being tested and will behave defensively and avoid doing work. i avoid this basin by typing unreasonably excited about the thing i want done. like way over the top. it's harder to keep that up than it sounds.
- reverius42 1 hour ago
  These are the vibes that power vibecoding.
zmmmmm 1 hour ago
That's a great write up.
The one thing I feel it seems to under estimate is the likelihood of improvement. Even the authors acknowledge it's not even worth comparing local models from a year ago to what we have now. In fact, people widely see Opus 4.5 in November last year - 8 months ago - as the first time agentic coding became viable broadly viable even with frontier hosted models.
So why would we lock in hard on any concept at this point of what a local model is and isn't good for? Whatever it is right now, it probably won't be that in a year. It might be naive optimism to think we'll ever get to long horizon tasks with models that run on consumer / pro grade hardware. But so far the naive optimists are winning.
[-]
- sanderjd 1 hour ago
  Right. Opus 4.5 8 months ago, good enough for agentic coding. How far behind that are open weight models? More than 8 months? But how much more? When will they reach Opus 4.5 level? A few months from now? A year from now? Never?
  [-]
  - theplumber 46 minutes ago
    I think in the next 6 months we will have Opus 4.5 performance in open models. We are very close
  - marak830 35 minutes ago
    GLM 5.2 came out today and the early reports have been quite good. Very difficult to run except on prosumer hardware, but small business could quite easily (or something like open router).
- rippeltippel 1 hour ago
  Since the author is referring to a specific model, I think it makes sense to ignore how the model (or local models in general) may improve over time.
  It's like buying a car: I drive that car and get attuned to its characteristics; I don't think how that car (or similar cars) may improve. That's my tool and I want to make the most of it.
  It is true that switching a local models it technically very cheap, but there's a considerable time investment in squeezing the most out of it, which may not work on a newer version of that model.
- appplication 1 hour ago
  Agree 100%, even on claude 4.5 being the turning point for agentic coding. It completely turned me around on it.
hypfer 1 hour ago
That was a lot of text for me still having no idea what the point of the author was (beside what I can infer from the headline that is).
I do however now know that they're a totally cool dude building stuff physically and as software + that other people give them money for it.
Does that have anything to do with the topic suggested by the headline? Not sure.
[-]
- neonstatic 32 minutes ago
  Everything is an ad these days. The article was not useless, but for the information it provides, it could have been two paragraphs.
  [-]
  - hypfer 28 minutes ago
    FWIW it told me stuff about openfaas. Now I know how to mentally file it and how to mentally file the author. The GitHub profile alone might not have sent the same signal, so this is useful.
    Is it bad software? Idk. Probably not.
    Should you treat it as a grassroots Foss thing maintained by fellow sane hackers? No sir.
gpt5 1 hour ago
This article is a good summary of local models. Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work. The reality is that they are rather limited, would not do well on a long or complex task, and are prone to fall into loops, forget their tasks, etc. Not mentioned in the article is that they are also rather expensive - not just for the hardware cost, but also electricity. These 3090 and 5090 machines are pretty power hungry, and these models are pretty slow on these machines, making them consume more power per token.t
Where they shine is in your ability to control them, their privacy, their predictability (e.g. if you are doing a repetitive task, like classifying your photo/video library), and depending on your energy bill - their costs.
[-]
- usernomdeguerre 1 hour ago
  I believe that local models are a necessary extension of the personal computer and I imagine that one could have had similar criticisms of early personal computers.
  [-]
  - pmontra 43 minutes ago
    Of course the early MSDOS PCs where loud and power hungry. I can't remember the specs but according to Wikipedia the IBM PC with a 80286 had a 192 Watt power supply. I don't remember if by then we had internal hard disks or we still had to buy a case as large as the one of the PC with a 10 or 20 MB disk inside. It was handy to raise the monitor further up.
- i_idiot 1 hour ago
  > Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work.
  They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.
- sanderjd 1 hour ago
  But that's current hardware. What about future hardware? What about hardware optimized for inference? What about hardware optimized to run a particular model?
wallkroft 36 minutes ago
>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels"
wallkroft 35 minutes ago
>Local Qwen isn't a worse Opus >looks inside >local Qwen is not "near Opus levels
cptskippy 1 hour ago
I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.
https://github.com/cptskippy/battlemage-llm-gateway
Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.
[-]
- askvictor 42 minutes ago
  Does Intel make decent GPUs now? I must be out of the loop...
  [-]
  - speedgoose 2 minutes ago
    They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.
    I am not sure whether you can find those in stock anywhere.
- hbbio 1 hour ago
  Interesting setup, thx for sharing.
  How many tokens/sec do you get with 27b? Are you using MTP?
- jauntywundrkind 1 hour ago
  What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.
  [-]
  - Ritewut 44 minutes ago
    Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.
    [-]
    - Natalia724 36 minutes ago
      Agree. For local coding help, latency often matters more than raw benchmark quality. A slightly weaker model that answers immediately changes how often you reach for it.
opptybiz 2 minutes ago
[dead]