> Open models are served via various means, some by the companies that released them and some by third parties like OpenRouter. Unfortunately, both of these routes are dodgier in terms of privacy and data sharing, and I would not feel the same comfort sending API calls containing client or confidential data to them.
That's why I'm using eurouter.ai with the following routing rule for all my requests:
Sure, it's quite expensive, but at least on a legal side data privacy is ensured. I trust them more than e.g. Anthropic, OpenAI or OpenRouter.
Personally, I find it morally unacceptable to use U.S. AI tools, because I do not want to support them financially and thus support the crimes they are involved in[1].
Why use EU specifically? I get not trusting the US, of course, but surely the EU isn't far behind in its desire to spy on its own citizens. Do you not live there?
From all the large governmental institutions, the EU is the one currently holding up traditional western values. That gives it street cred in this subject.
With all the issues in the US and generally wrong direction, I can’t remember them ever arresting people for mean tweets in the way that Germany and the UK have. They all seem to be running full speed towards a surveillance state.
> With all the issues in the US and generally wrong direction, I can’t remember them ever arresting people for mean tweets in the way that Germany and the UK have.
Then you haven't been paying attention. The constitution prevents citizens from being convicted, but that doesn't stop arrests or being turned away at the border (even for permanent residents who've lived in the US for decades), and US citizens don't seem to care, so it's cold comfort for many of us.
You are talking about something different (in bad faith). Please share a single instance of a US citizen being arrested for an offensive social media post.
The UK can arrest you for hate speech. You can disagree with that policy on free speech terms if you want, and that’s really a maximal free speech position. It’s a very strange position to hold if you’re claiming that the U.S. is better when it comes to free speech. The U.S. administration is engaged in active smear campaigns against anyone who speaks loudly against them, threatened to revoke licenses of media companies, they’re suing people and corporations to silence them and pressure them into conformance, they’re threatening to deport people who are simply expressing anti-Israel views, threatened to remove funding from universities, deployed the military in cities they don’t like for no other reason than intimidation of political rivals. This is just off the top of my head.
There’s just no comparison really. You must really be inhaling some nonsense X propaganda if you think government overreach is worse in Western Europe.
> The US federal government forced Paramount to take Colbert off the air.
Not really; the Ellisons are quite close to Trump. Nobody was forced to do anything. Had the FCC actually revoked their license, and had Paramount actually been willing to fight, they could have sued. It's not easy to force anyone that rich to do anything; the state works on behalf of capital. It seems like europe is more aware of the meaningless bluster than the actual crimes being committed
There are much better things to point to to illustrate the deterioration of the rule of law, like blatantly illegal deportation of citizens without due process. Or raping children in concentration camps under the guise of cracking down on crime. We may never even know who was seized and what happened to them and there's little incentive for our very pro-corporate media to report on it.
I'm honestly not really sure what "traditional western values" have to do with where to store data. What does that even refer to—individualism? Christianity? Representation in court by lawyers? How does this intersect with the topic at hand?
I think it's interesting that people write off open weight models because they're "a few months behind" proprietary models.
I know LLMs move at the speed of light (especially these past few quarters), but if Opus and GPT "a few months ago" were really like open weight models, then there's really no reason to not switch, especially for those who were using these models a few months ago.
Your codebase didn't change, so use the open weight model. Don't move the goalposts.
Every new proprietary model is "groundbreaking" and "look, it just solved task X that no other model could solve," only to be referred to as "that crappy previous-generation model" a month later.
So yeah, I'm totally fine using Kimi-2.7, GLM-5.2 or Deepseek-v4. I think we've already hit the ceiling and most improvements now seem to be from harness improvements and slightly better RL to improve reasoning/tool calling.
There's at least the possibility that they intentionally degrade the models as time passes. We can't really verify that we're getting what we're paying for all of the time. All the more reason to invest in local inference.
What if the new model is exactly as good as the last model on launch day but better than the last model was on the new model's launch day because it was degraded? Every single time?
The thought has definitely crossed my mind. I don't think it's true because there's definitely an improvement when new models are released.
Maybe the truth is the newest models aren't actually as impressive as we thought. Maybe our perception of progress is being manipulated via months of gradual, silent and unverifiable degradation.
People talk about this a lot. What I have never seen is a discussion of methods they might employ to degrade the models.
Let’s say I’m a bad faith LLM operator, and I want to degrade my model so the next release looks better and people want to switch to the more expensive one. How would I do that?
They would quantize the model. That'd make it cheaper to run, and have slightly worse output but it would still generate outputs with a similar feel, derived from a compressed version of the same knowledge base etc.
They wouldn't even need to do this uniformly, quantized versions of the model could be routed only a subset of the requests. They could do this to nerf the old model, or more likely just to give themselves more hardware to run the new one on by handling more requests on less hardware. Or to handle increased request volume as traffic ramps up faster than hardware can be provisioned.
Playing with local models at various quants, the degradation can be hard to spot. Sometimes it's only noticeable in aggregate. And even then, you never really know if you just got unlucky with a bad response due to RNG.
I've had Opus 4.6 fall into some weirdly incoherent loops that I rarely see from even Sonnet, that felt like the kind of thing I got frequently with Qwen3.5 9B on local. And the above applies... Was that just bad RNG? Or was my request to Opus routed to some lower quality variant? There's no great way for me to tell for any given request, nor any way to guarantee Anthropic _didn't_ do that.
Unless what you're getting is really explicitly spelled out in a contract, you should flatly assume that they're doing whatever they like whenever they like.
Current prices are insane but at this point I'm starting to feel like it's an existential issue. I'm not a US citizen. At any point the USA could come up with some arbitrary export controls. Not having a computer capable of running at least Qwen is starting to actually seem risky to me.
At least it's going to be usable as a very high end gaming PC.
Why would you buy and build everything before the low probability catastrophe strikes, though? You don’t get any benefit from switching early and you pay a big opportunity cost.
Also, there's a nontrivial learning curve involved in running your own inference server, once you move past the casual-goofing-around-with-llama-server stage. If you care about not being a sharecropper on Sam's or Dario's plantation, you should consider learning the ropes. Even if you don't put these skills to immediate use in your day job.
I didn't appreciate this until I started down that road myself.
> If you care about not being a sharecropper on Sam's or Dario's plantation
Couldn't have put it better myself. That's what all this comes down to. Owning the hardware, owning the inference. Not perpetually renting them out on a meter like in the dystopian future they're envisioning.
There is also a low probability that someone enters peace negotiations solely to threaten the negotiators with death, yet here we are. With these guys it is: Better safe than sorry.
I’m an LLM fan, but from an engineering perspective the idea of building atop services that palpably fluctuate in capacity, performance, and capability is nutty.
Even with minor automation I feel like I can watch OpenAI and Anthropic engineers fiddling in real-time. Tuesdays behaviour changes by Thursday, 10AMs production isn’t possible at 11:30AM. Nutty.
I chilled significantly on using Google for anything to do with business due to API (and offering) stability. (Still use Google for personal things.) But AI models seem orders of magnitude more fluid, so to my risk-averse eye, they're nothing I'd base my own business on.
> I think it's interesting that people write off open weight models because they're "a few months behind" proprietary models
I experiment a lot with the open models and I’m getting tired of this trope. I’m not yet convinced that even the best open weight models are equal to Opus from “a few months” ago.
I know what the benchmarks say. I had higher hopes. My real experience just doesn’t match the benchmarks.
I also do a lot of work that even Opus 4.8 struggles with. When even the cutting edge LLMs aren’t all the way there yet, my motivation to switch to something even further behind just isn’t there.
Have you found anything specific that the full-precision quant of GLM 5.2 can't do that Opus 4.8 can? I haven't, so far.
5.2 lives up to the hype. I don't find it to be the best at anything except coding. But for coding... yeah, it lives up to the hype. Not quite Opus 4.8-level, but I would feel comfortable comparing it to 4.5, at least if it had vision capabilities.
We have a provider with Deepseek V4 flash at our work. It can handle 95% of the "actually functional" workload at a tenth of the cost. I still pull up beefier ones sometimes, but that's after some consideration.
The moat is so flat, it only gives +1 food and +1 production. +1 gold with a road.
Intelligence is maybe a few months behind. But cost sadly is further behind. GLM-5.2 has a deceptively high cost during day-to-day usage for e.g. coding because 1) it has to think a ton more than GPT-5.5/Opus-4.8 to get to competitive results; 2) many providers are still figuring out caching; and 3) API pricing for Codex/Claude can be as high as 40x more than subscription pricing, which distorts the market.
For that matter, the new models are shit. If I’m using Opus 4.6 anyway to get anything actually done, then great, we’re actually entirely caught up then.
> I think it's interesting that people write off open weight models because they're "a few months behind" proprietary models.
The really interesting thing is that it's typically those very same accounts who were explaining, a few months ago, that thanks to their commercial model they were gaining so much time and producing so much fantastic code.
A few months passes and suddenly the open-source model have caught up with the models that were gaining them so much time and that produced amazing code (in production everywhere for sure btw) but... It's impossible to work with these models.
Rinse and repeat.
The current models, according to them, are basically AGI and they can go fishing while paid subscriptions solve the world's problems.
But when it six months there shall be new closed, pricey, models and when the open ones shall have reach the level of Fable, we'll hear how it's impossible to work in late 2026 on a model that is "only at the level of Fable".
These people should have been snake-oil salesmen (and it could be what they actually are).
My most charitable interpretation that there's some honeymoon effect for each release, and people genuinely feel very productive and useful for 2-3 months. By the time the next big model release happens they've seen some issues or run into something that makes them feel like the new model will fix all that and improve their flow so much, etc.
Not unusual in the tech space, but this has been basically constantly happening for two years now? I can't imagine the improvements are more than incremental at this point.
They are generally referred to as the Kool-Aid drinkers. There's always something holding them back from open models. It's no different than the argument in the article. I've been daily driving Linux for well over 20 years at this point and while things have gotten easier they haven't gotten that much easier. There's always been a distro that's focused on new users or ease of use. I used to take for granted the Linux distro ecosystem but now worry how Microsoft, Apple and others will continue to try and legislate compute into a corner. I can appreciate good engineering, but when I look at OS X and Windows they're both failing end users in different ways.
Just like the OS ecosystem I think we'll see a similar trajectory with OAI, Anthropic and Google but on a much accelerated time scale. I think the lobbying has begun to lock in their fate for revenue - because none of them give a shit about their users. I do hope, however, that Anthropic continues to over rotate and continue to gimp their models into uselessness. I just asked Opus 4.8 the other day to look at some code as an adversary and summarize areas that should be addressed. Nothing specific and it shut down the conversation. However starting a new prompt and prodding the model from a different angle yielded the results I asked for directly. Pick a lane. Or, don't and continue to lose industry respect and consideration.
Even just one of the smaller models is good enough for the grunt work I use them for 90% of the time. Currently doing most of my home hobby projects with OpenCode Go and Qwen 3.7 Plus, it's not great at diagnosing issues in the code, but if I can clearly articulate a test suite or boilerplate refactoring it works fine.
The headline says one thing, then the article text says this:
> I’m hoping it’s going to be minimal.
I have multiple subscriptions and I pay per token to try out different LLM providers through OpenRouter. I also run open weight models locally.
I just can’t agree yet. The models from Anthropic and OpenAI really are that much better than anything else. The open weight models must be universally benchmaxxed across the board because my real world experience with them is very different than what the benchmarks imply. I get downvoted a lot for speaking about my experience because I don’t think it’s the reality that people want to hear right now, but it’s true for complex work.
I do think there are a lot of easier tasks that can be handled appropriately by the open weight models in the hands of a skilled operator. If an entire job is simple enough that you wouldn’t hesitate to hand it off to a junior with a little supervision then any model will do. However for a lot of the work I do, even Opus 4.8 on Max requires a lot of attention and extra steering and review to keep it on track. Fable did, too, though to a lesser degree. When I try to use the big open weight models (hosted, because they’re not running at reasonable speeds locally at a quantization I can tolerate) it feels like I spend more time waiting while they burn tokens for output that I probably have to reject anyway, at least for the bigger tasks. I wish they were there, but that’s not the case yet.
Claude started becoming useful for my coding purposes after it hit version 4.6. After that sure some nice to have additions but I think if I had 4.6 sonnet & opus as open weights, I would not need something more.
Having played a bit with Fable, reinforced the above.
One big advantage I’ve found — people get attached to models (including me). With open models if you find one that works perfectly for you but the next version doesn’t, you can run the old one forever (or someone will for you)
But… the models will fall behind. As libraries and languages and tool calling updates or the world knowledge changes, the models decay.
Personally, I don’t like the change, but it’s just how technology works so I’d rather move with the flow than try to stick my foot down and freeze time.
the pricing page doesn't seem to call it out anymore, but the claim on z.ai coding plan used to be 3x the usage of the equivalent-price claude plan. whether that's accurate i don't know, but just based on api pricing GLM is way cheaper.
I’ve been wanting to get better acquainted with local inference but I don’t have the hardware, which has made me think about something I haven’t seen discussed, which is local collaboratives. The economics makes it seem like a group of people joining together to run good hardware and an open model might make sense, but I haven’t seen anything like this mentioned. Have I been missing it?
I think it would be pretty neat to launch a service helping people who wanted to participate in something like that locate one another.
There are plenty of providers of open models that offer very affordable rates. Generally, I recommend looking at OpenRouter since they track various metrics for the various providers.
TL;DR: Running GLM 5.2 is going to cost about $20K minimum, and that's going to be painfully slow compared to the cloud hosted versions. Even the estimates where the server is computing tokens 24/7 you can't break even for several years.
The only reason to run locally is if complete data privacy is your top concern. You pay a high premium for that.
Have you read about Opencode Go? They are great provider for open model, like GLM 5.2, Deepseek v4 Pro, Kimi 2.7 Code. You should give it shot to them :-)
Open source models are still not good enough for now, but with the current speed of one new SOTA every two months, by this time next year we will definitely have cheap open source models at least as good as Fable :)
I don't think we will. The open model labs are too resource constrained to approach Fable or even Opus on the general case and I don't see that changing within a year.
Right now, due to profound shortfalls in both data and hardware compared to the US labs, the OSS models are IMO basically technology demonstrators that in practise are even more jagged than the US labs' efforts. The high points of the jaggedness are close - but number of happy paths is many times fewer, and their behaviour inside the harness is far less refined. Barring some incredible breakthrough I don't think that is changing without a much higher level of resources - which seems impossible given the current hardware environment.
I have no reason to think that Anthropic or OpenAI are in possession of some secret sauce that the Chinese labs can't duplicate given the right resources, but the fact remains that absent those resources they'll remain behind. Barring some incredible bombshell reveal from Huawei I don't think this asymmetry resolves in a year. In three years it may well be a different story.
I think the frontier will command premium for sometime just as slight better software developers were 10x's vs their peers as their architecture & development strategies and code approach compounded quickly. One less error per block of work compounds quickly.
Sure, there may be some cases and reasons for local models and industry is so large they will continue to make progress and gather economic value and users for specific use case; but frontier will command vast majority of the economic value distinct from Linux and open source where the model created better than proriatary economic incentives around development
Ultimately its a financial game. Open source is far cheaper so it already has an upper-hand. Frontier models have to justify financially why they are worth the additional spend.
10x developers were not slightly better than their peers, they were vastly superior and faster. OTOH, the lead of frontier llms is diminishing as training is getting diminishing returns.
Also, on that note. Not every company needs 10x developers, just as not every task needs frontier llms. Ultimately, operating costs will be the largest contributing factor.
I am absolutely pro local and true open source models.
Personally I haven't seen any productivity gain since Opus 4.5 times.
But: I can't fully get behind the opinion that (so called) "open source models" are simply superior and will be in the future, because when I asked some models who they are, they answered with "I am Claude from Anthropic", which could mean they have been trained by exfiltrating Claude.
I have NO moral objection to this, as Anthropic and "Open""AI".also trained their models on anything they could get their hands on.
It's more about the question: can and will these models be updated, even if Anthropic et al fail. Who's gonna pay for training then? What's their incentive? Have we reached a plateau?
yeah, on a 96GB Mac Studio and Gemma+Qwen, it's definitely fully doable. fully doable but not really for coding on 16GB. but svelter models and cheaper (eventually) hardware are coming!
"cheaper (eventually) hardware"
Best case 2-3 years from now. Otherwise it will take a major global recession to get us anywhere near last year's prices.
I suspect hosted and local will converge when hardware prices come down and API prices go up. The massive rate of datacenter build out will be unsustainable. Right now the hosted models are massively cheaper than buying the hardware and running it yourself which signals that hosted is very subsidized.
They use the GPU but an Apple Silicon GPU has the same high speed access to all the RAM on the machine as the CPU does, rather than having its own walled-off maybe 16 GB VRAM in mainstream gaming GPUs or 24 GB in RTX 4090 or RTX 5090 (MSRP $1999 but in practice $3000-$4000 at the moment). Nvidia A100 (80GB VRAM) apparently cost $15,000 or so.
Not only does Apple's unified memory give the GPU more RAM to use, but it also eliminates copying things between CPU RAM and GPU RAM.
A Mac Mini with 48 GB RAM costs $1799. A Mac Studio with 96 GB RAM is $3999 — until March you could get a Mac Studio with 512 GB RAM for $3999, all of which could be used for your AI model.
If you don't have that hardware thr math of buying a depreciating computer is challenging if you are satisfied with the $100/month plans ($1200/year). A 96GB Mac Studio is ~$4k. I think if you have the hardware already as a sunk cost then yes it makes sense. But I'm not sure it is worth spending $4k for today's hardware vs waiting for newer hardware in a few years.
I think once the hardware process comes down and these mini DGXs become cheaper, and by then open models still be smaller and better, there is going to be less and less reason to use the providers.
CEOs are already complaining that they are costing too much. There are also large organisations like Banks which can't use external services and are already looking at internal housing.
it's a good thing so the big AI companies just went IPO as once the self hosting trend kicks in they are going bust.
>There was a time not too long ago when using Linux entailed some professional risk1. First there was compatibility: you may not have been able to render a Word document or PowerPoint correctly, and you might have had to trust Open Office’s export capability to render docs the way you wanted
For a while during this era, I used to port my laptops windows installation into a virtual machine that can run on Linux. It took a bit of hacking away but I could usually do it in a day or two. Then its all Linux with the windows vm being used for the microsoft stuff.
I know open models have gotten quite good in many tasks such as coding or composition, but are there any that can access the internet and retrieve data like ChatGPT, Claude, etc can?
I do have to admit I have recently begun wishing I could pay five dollars a month for a "just answer the fucking question" plan that would give me results without the guardrails and without the constant simpering and ego-stroking. I keep finding myself going a quick evaluation of "is it faster for me to skim search results myself or to construct an elaborate narrative to make an AI give me a real answer".
That's why I like qwen3.6 27B, it has 0 ego, it knows that it doesn't have complete world knowledge, so when it sees a web_search tool it searches all the time. Even qwen3.5 9B is mostly search-eager (but given the size, it's weaker on reasoning on the results if that's needed). I use a stock pi harness with only web_search and web_fetch (cleans up the html to only keep text) tools defined.
I have given up on making Opus actually retrieve online information for me. At this point I only query it side by side with qwen to laugh at how it didn't even attempt to search properly, and how a small local model is beating it every time. Gemini is very fast for searching, but somehow miss-sources all the time.
> I know open models have gotten quite good in many tasks such as coding or composition, but are there any that can access the internet and retrieve data like ChatGPT, Claude, etc can?
The things you describe are just tool calling, they're a feature of whatever harness you use. Use OpenCode, pi.dev, or maki.sh with any of the open models.
> I do have to admit I have recently begun wishing I could pay five dollars a month for a "just answer the fucking question" plan that would give me results without the guardrails and without the constant simpering and ego-stroking. I keep finding myself going a quick evaluation of "is it faster for me to skim search results myself or to construct an elaborate narrative to make an AI give me a real answer".
You can do most of this with some system prompts added to whatever agent you're using. You can do it from the settings on the claude/chatgpt websites too. (minus the no-guardrails thing)
You can let the AI solve it itself, and then it will provide two solutions: implement a local search service (easily blocked), or purchase a Web Search API service
As someone that has pretty powerful desktop that I've been using with local open weight models, people are far exaggerating the quality of them. Some of them are now useful. They don't compare yet to the online models of ChatGPT, Claude, Gemini, etc. They are still about 18 months behind. I have accomplished useful work with them, like image classification on Gemma4, but they are much much slower, much much more expensive and they don't scale at all.
A $10,000 RTX 6000 Blackwell card will pay for 500 months of Claude or Codex, which is 40 years worth of compute. Obviously they are going to raise their prices, my prediction being to $200-500/month, but that still makes them at least years of compute and they scale very well with more traffic. Single GPUs do not, they are pegged at 100% and good luck getting it to answer multiple queries at the same time.
That's why I'm using eurouter.ai with the following routing rule for all my requests:
Sure, it's quite expensive, but at least on a legal side data privacy is ensured. I trust them more than e.g. Anthropic, OpenAI or OpenRouter.Personally, I find it morally unacceptable to use U.S. AI tools, because I do not want to support them financially and thus support the crimes they are involved in[1].
[1]: https://news.ycombinator.com/item?id=48512339
- The prices are ridiculous (15 % markup for free account).
- They have a rate limit of 1000 requests per month, unless you pay 40€ per month for ... what exactly is their value proposition?
- They have a single provider (TensorX) for DeepSeek-V4-Pro, with a cache read cost that is over 100 times higher than DeepSeek ($0.44 vs $0.003625).
Then you haven't been paying attention. The constitution prevents citizens from being convicted, but that doesn't stop arrests or being turned away at the border (even for permanent residents who've lived in the US for decades), and US citizens don't seem to care, so it's cold comfort for many of us.
This seems tautological because Europe is pretty weak on the values that people in the US might care about (freedom of speech, limited govt, etc).
What values specifically are you optimizing for here?
There’s just no comparison really. You must really be inhaling some nonsense X propaganda if you think government overreach is worse in Western Europe.
The US federal government forced Paramount to take Colbert off the air. Seems that people in the US don’t actually value these things.
> What values specifically are you optimizing for here?
Probably not being fascist.
Not really; the Ellisons are quite close to Trump. Nobody was forced to do anything. Had the FCC actually revoked their license, and had Paramount actually been willing to fight, they could have sued. It's not easy to force anyone that rich to do anything; the state works on behalf of capital. It seems like europe is more aware of the meaningless bluster than the actual crimes being committed
There are much better things to point to to illustrate the deterioration of the rule of law, like blatantly illegal deportation of citizens without due process. Or raping children in concentration camps under the guise of cracking down on crime. We may never even know who was seized and what happened to them and there's little incentive for our very pro-corporate media to report on it.
Maybe it was funny to you, but designing data platforms that respect GDPR and involve LLMs is a thing.
I know LLMs move at the speed of light (especially these past few quarters), but if Opus and GPT "a few months ago" were really like open weight models, then there's really no reason to not switch, especially for those who were using these models a few months ago.
Your codebase didn't change, so use the open weight model. Don't move the goalposts.
So yeah, I'm totally fine using Kimi-2.7, GLM-5.2 or Deepseek-v4. I think we've already hit the ceiling and most improvements now seem to be from harness improvements and slightly better RL to improve reasoning/tool calling.
Maybe the truth is the newest models aren't actually as impressive as we thought. Maybe our perception of progress is being manipulated via months of gradual, silent and unverifiable degradation.
Let’s say I’m a bad faith LLM operator, and I want to degrade my model so the next release looks better and people want to switch to the more expensive one. How would I do that?
They wouldn't even need to do this uniformly, quantized versions of the model could be routed only a subset of the requests. They could do this to nerf the old model, or more likely just to give themselves more hardware to run the new one on by handling more requests on less hardware. Or to handle increased request volume as traffic ramps up faster than hardware can be provisioned.
Playing with local models at various quants, the degradation can be hard to spot. Sometimes it's only noticeable in aggregate. And even then, you never really know if you just got unlucky with a bad response due to RNG.
I've had Opus 4.6 fall into some weirdly incoherent loops that I rarely see from even Sonnet, that felt like the kind of thing I got frequently with Qwen3.5 9B on local. And the above applies... Was that just bad RNG? Or was my request to Opus routed to some lower quality variant? There's no great way for me to tell for any given request, nor any way to guarantee Anthropic _didn't_ do that.
At least it's going to be usable as a very high end gaming PC.
I didn't appreciate this until I started down that road myself.
Couldn't have put it better myself. That's what all this comes down to. Owning the hardware, owning the inference. Not perpetually renting them out on a meter like in the dystopian future they're envisioning.
There is also a low probability that someone enters peace negotiations solely to threaten the negotiators with death, yet here we are. With these guys it is: Better safe than sorry.
Long term predictability ought to far outweigh a few more cycles of performance.
The top models also seem to have inconsistent performance depending on the time of day and how far we are from the next release.
Even with minor automation I feel like I can watch OpenAI and Anthropic engineers fiddling in real-time. Tuesdays behaviour changes by Thursday, 10AMs production isn’t possible at 11:30AM. Nutty.
https://marginlab.ai/trackers/claude-code-historical-perform...
There were at least a couple of these degradation trackers.
I experiment a lot with the open models and I’m getting tired of this trope. I’m not yet convinced that even the best open weight models are equal to Opus from “a few months” ago.
I know what the benchmarks say. I had higher hopes. My real experience just doesn’t match the benchmarks.
I also do a lot of work that even Opus 4.8 struggles with. When even the cutting edge LLMs aren’t all the way there yet, my motivation to switch to something even further behind just isn’t there.
5.2 lives up to the hype. I don't find it to be the best at anything except coding. But for coding... yeah, it lives up to the hype. Not quite Opus 4.8-level, but I would feel comfortable comparing it to 4.5, at least if it had vision capabilities.
That's exactly the problem I have... with Anthropic and "Open""AI"
The moat is so flat, it only gives +1 food and +1 production. +1 gold with a road.
The really interesting thing is that it's typically those very same accounts who were explaining, a few months ago, that thanks to their commercial model they were gaining so much time and producing so much fantastic code.
A few months passes and suddenly the open-source model have caught up with the models that were gaining them so much time and that produced amazing code (in production everywhere for sure btw) but... It's impossible to work with these models.
Rinse and repeat.
The current models, according to them, are basically AGI and they can go fishing while paid subscriptions solve the world's problems.
But when it six months there shall be new closed, pricey, models and when the open ones shall have reach the level of Fable, we'll hear how it's impossible to work in late 2026 on a model that is "only at the level of Fable".
These people should have been snake-oil salesmen (and it could be what they actually are).
Not unusual in the tech space, but this has been basically constantly happening for two years now? I can't imagine the improvements are more than incremental at this point.
Just like the OS ecosystem I think we'll see a similar trajectory with OAI, Anthropic and Google but on a much accelerated time scale. I think the lobbying has begun to lock in their fate for revenue - because none of them give a shit about their users. I do hope, however, that Anthropic continues to over rotate and continue to gimp their models into uselessness. I just asked Opus 4.8 the other day to look at some code as an adversary and summarize areas that should be addressed. Nothing specific and it shut down the conversation. However starting a new prompt and prodding the model from a different angle yielded the results I asked for directly. Pick a lane. Or, don't and continue to lose industry respect and consideration.
not all of us are doing noob shit lol
> I’m hoping it’s going to be minimal.
I have multiple subscriptions and I pay per token to try out different LLM providers through OpenRouter. I also run open weight models locally.
I just can’t agree yet. The models from Anthropic and OpenAI really are that much better than anything else. The open weight models must be universally benchmaxxed across the board because my real world experience with them is very different than what the benchmarks imply. I get downvoted a lot for speaking about my experience because I don’t think it’s the reality that people want to hear right now, but it’s true for complex work.
I do think there are a lot of easier tasks that can be handled appropriately by the open weight models in the hands of a skilled operator. If an entire job is simple enough that you wouldn’t hesitate to hand it off to a junior with a little supervision then any model will do. However for a lot of the work I do, even Opus 4.8 on Max requires a lot of attention and extra steering and review to keep it on track. Fable did, too, though to a lesser degree. When I try to use the big open weight models (hosted, because they’re not running at reasonable speeds locally at a quantization I can tolerate) it feels like I spend more time waiting while they burn tokens for output that I probably have to reject anyway, at least for the bigger tasks. I wish they were there, but that’s not the case yet.
Having played a bit with Fable, reinforced the above.
Personally, I don’t like the change, but it’s just how technology works so I’d rather move with the flow than try to stick my foot down and freeze time.
I think it would be pretty neat to launch a service helping people who wanted to participate in something like that locate one another.
There's a post at the top of /r/localllama about this exact math right now: https://www.reddit.com/r/LocalLLaMA/comments/1ubrcwj/tokenom...
TL;DR: Running GLM 5.2 is going to cost about $20K minimum, and that's going to be painfully slow compared to the cloud hosted versions. Even the estimates where the server is computing tokens 24/7 you can't break even for several years.
The only reason to run locally is if complete data privacy is your top concern. You pay a high premium for that.
I like the Linux analogy, I struggled with Linux way back.
$10 a month gets you generous usage with the best open weight models and they claim to have zero retention and not to train on your usage.
It’s unclear to me what the advantages of openrouter are but it seems to be a default I see many people talking about here.
Right now, due to profound shortfalls in both data and hardware compared to the US labs, the OSS models are IMO basically technology demonstrators that in practise are even more jagged than the US labs' efforts. The high points of the jaggedness are close - but number of happy paths is many times fewer, and their behaviour inside the harness is far less refined. Barring some incredible breakthrough I don't think that is changing without a much higher level of resources - which seems impossible given the current hardware environment.
I have no reason to think that Anthropic or OpenAI are in possession of some secret sauce that the Chinese labs can't duplicate given the right resources, but the fact remains that absent those resources they'll remain behind. Barring some incredible bombshell reveal from Huawei I don't think this asymmetry resolves in a year. In three years it may well be a different story.
Sure, there may be some cases and reasons for local models and industry is so large they will continue to make progress and gather economic value and users for specific use case; but frontier will command vast majority of the economic value distinct from Linux and open source where the model created better than proriatary economic incentives around development
Ultimately its a financial game. Open source is far cheaper so it already has an upper-hand. Frontier models have to justify financially why they are worth the additional spend.
Also, on that note. Not every company needs 10x developers, just as not every task needs frontier llms. Ultimately, operating costs will be the largest contributing factor.
Personally I haven't seen any productivity gain since Opus 4.5 times.
But: I can't fully get behind the opinion that (so called) "open source models" are simply superior and will be in the future, because when I asked some models who they are, they answered with "I am Claude from Anthropic", which could mean they have been trained by exfiltrating Claude.
I have NO moral objection to this, as Anthropic and "Open""AI".also trained their models on anything they could get their hands on.
It's more about the question: can and will these models be updated, even if Anthropic et al fail. Who's gonna pay for training then? What's their incentive? Have we reached a plateau?
I enjoyed the first part though
and what hardware are you using?
Not only does Apple's unified memory give the GPU more RAM to use, but it also eliminates copying things between CPU RAM and GPU RAM.
A Mac Mini with 48 GB RAM costs $1799. A Mac Studio with 96 GB RAM is $3999 — until March you could get a Mac Studio with 512 GB RAM for $3999, all of which could be used for your AI model.
https://www.tomshardware.com/tech-industry/apple-pulls-512-m...
Some are coming up used at silly prices.
https://www.trademe.co.nz/a/marketplace/computers/desktops/a...
NB NZ$44,999 is "only" US$25,772.
For a while during this era, I used to port my laptops windows installation into a virtual machine that can run on Linux. It took a bit of hacking away but I could usually do it in a day or two. Then its all Linux with the windows vm being used for the microsoft stuff.
I do have to admit I have recently begun wishing I could pay five dollars a month for a "just answer the fucking question" plan that would give me results without the guardrails and without the constant simpering and ego-stroking. I keep finding myself going a quick evaluation of "is it faster for me to skim search results myself or to construct an elaborate narrative to make an AI give me a real answer".
I have given up on making Opus actually retrieve online information for me. At this point I only query it side by side with qwen to laugh at how it didn't even attempt to search properly, and how a small local model is beating it every time. Gemini is very fast for searching, but somehow miss-sources all the time.
First time I did this I realized in 5 seconds that the big players weren’t going to be carving up the market between them.
The things you describe are just tool calling, they're a feature of whatever harness you use. Use OpenCode, pi.dev, or maki.sh with any of the open models.
> I do have to admit I have recently begun wishing I could pay five dollars a month for a "just answer the fucking question" plan that would give me results without the guardrails and without the constant simpering and ego-stroking. I keep finding myself going a quick evaluation of "is it faster for me to skim search results myself or to construct an elaborate narrative to make an AI give me a real answer".
You can do most of this with some system prompts added to whatever agent you're using. You can do it from the settings on the claude/chatgpt websites too. (minus the no-guardrails thing)
A $10,000 RTX 6000 Blackwell card will pay for 500 months of Claude or Codex, which is 40 years worth of compute. Obviously they are going to raise their prices, my prediction being to $200-500/month, but that still makes them at least years of compute and they scale very well with more traffic. Single GPUs do not, they are pegged at 100% and good luck getting it to answer multiple queries at the same time.