One interesting takeaway is the low score on Anthropic models from this benchmark. It’s not because of capability, it’s because Anthropic’s guardrails prevented it from solving the problem.
I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.
For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing
> Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.
What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.
This is a good point – because pentesting is entirely legitimate work, and security testing is a necessary and legitimate part of every day software engineering.
The problem is that the model can't tell the difference between doing it as part of regular development and doing it in a malicious context. And the root cause of that is that these models lack any sort of real awareness. Humans don't generally get tricked into hacking (in this way).
They see an opportunity to charge 10x for pen testing and defence work, while offence will be handled by actors with access to all kind of other models.
My org now sends some portion of our requests to non-anthropic models because refusal has become common from Claude. The requests themselves aren't dangerous, we find that benign requests in biological science wind up being blocked semi-frequently.
If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.
I was using a local Codex project as a personal knowledge base. So I would dump in documents, basic medical docs (like blood labs), and other things and have it file them.
It’s great at filing!
But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.
It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.
Yes. When certain keywords are matched or topics, there is a warning transparently injected server side appended to the system prompt of the convo that’s miles long. It is injected and reevaluated every tool call.
If you begin a generic reverse engineering task, 30+ tool calls in a row. The moment it sees something it doesn’t like, token burn, single tool calls iteration, “This is a known CTF challenge, I can proceed”, single tool calls iteration, “This is a real CTF challenge, I can proceed”, etc.
It’s heavily neutered now, without changing the model, and you pay for the privilege and don’t notice.
The end result of course being that it both expensive and useless for approved CTF tasks. No one is using Opus for security. If they think it’s working, the harsh reality is they’re not doing security work; they’re just generically finding bugs.
I do this for a job and can demonstrate this plain as day, dump the injected prompt, and notice what it’s doing isn’t security work, it just looks like it. Happy to write a blog about it if you want to know more. Apparently many people think it’s working for them when it absolutely isn’t.
What a joke. Must make it pretty easy to poison a session, you don't need to persuade the model about anything, just trigger its security controls, ideally after as much context as possible, but before it has generated any useful output.
Not directly, as it comes in as a not charged error but the weighted generation path used until you hit the guardrail is basically wasted tokens, so yes, indirectly. If I hit a guardrail and rewind I’ve found the training will still be biased towards guardrailing out if you rewind one turn. Rewinding multiple turns allows steering away from that path, but all of the original token spend down that path is wasted
I've run into some of the refusals to handle my credentials, but so far I've appreciated them. I was only handing over credentials that didn't matter, but it's still a good move, the chat logs are clearly stored somewhere to allow the resume functionality to work, which means your credentials can end up sitting around on your filesystem, and any malware would quickly learn to check for those files.
4.8 is insanely frustrating. This evening I had a few tasks to pull information in and it plainly stated that the environment it was in had no network access. After three asks to "try again, check the system prompt" it finally relented and then basically stated it was lying.
Fresh session, no prior context on 4.8. These things are becoming useless Duplo.
> guardrails prevented it from solving the problem.
Reminds me of the defense issues with Claude which were complained as “woke” but the reality is more horrifying to me, imagine trying to use a model to keep up with a land invasion on US soil, whoever the enemy is is irrelevant you just know they are using AI, and your guys are telling you that no matter what they type into the prompt it refuses, because if anyone has ever tried to jailbreak an LLM even if human lives are at stake they refuse the request. Now literally millions of lives are on the line but the guardrails that your enemies dont have on their models are costing you lives.
What do you even do then?
AI will always have this issue where it will always pick the worst option for genuinely good requests.
Because the military doesn't give soldiers rifles with guard rails. They give the soldiers intense, rigid training, and then try to enforce discipline and correct use socially.
If an LLM is going to be important in that way (this seems like a very contrived way,) then it's in the interest of the LLM's host to make sure it doesn't have guard rails that would get in the way _that_ way.
I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.
Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.
The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.
I'd run Mythos against the code in your zip file, but the NDA I signed at Apple prevents me from using it on anything outside the scope of my work. Honestly, I wish more people from Project Glasswing could talk publicly about their experiences with the model. It would probably put an end to a lot of the speculation that keeps circulating through the industry. Unfortunately, that's not the reality we're in. I don't have the time, energy, or financial resources to fight a legal battle with one of these companies over an agreement I knowingly signed, even if the chances of them actually suing are low. Maybe someone else in Project Glasswing is willing to burn their NDA and post the Mythos results?
That's an example of why it would be useful for someone to actually do it. A random commenter on HN is one thing. A direct comparison on a brand new app that isn't part of any training is another
lol what is even the point of this kind of comment? this is the ultimate "source: trust me bro" comment I have ever seen.
every model since gpt3 was claimed to be "too dangerous to release." it's too EXPENSIVE to release, and you're probably a local model with <10B parameters yourself
Nice write up, thanks. When I used claude to do some pen testing for one of my apps it initially refused. After I explained and demonstrated I'm the author, it reasoned through it and allowed it.
It seems harsh to critique guardrails and take them into account in the scoring when GPT-5.5 seems to have been explicitly whitelisted to remove most of said guardrails. A more fair comparison would be a vanilla GPT account.
I noticed with each model release Anthropic constrains the model more security wise. Its propensity to refuse doing legitimate work has been increasing. It now puts up more resistance around performing logins, handling credentials on behalf of the user, etc.
For myself, it’s already gotten to the point where it has mildly affected the usefulness of the model. If I bump on some action I want it to do I can usually work around it, but I suspice the ability to do so will close with each new release. Eventually I’ll reach a point where I am forced to choose between the useful aspects of the model and the limiting ones instead of just picking the most capable model out there
Eventually these models will significantly suffer from overfitting to the least common denominator. If I have this beautiful deterministic setup that swaps secrets out in flight so the LLM never sees them, I’m going to be really annoyed when the LLM still won’t send them out because it is trained to deal with the 99% of people just doing the dumb thing
No, the choice will be whether or not to to upgrade to "Claude Security Professional" or whatever they want to brand it as.
What look like tightening "constraints" today are just setting up the upsell opportunities of tomorrow.
The problem is that the model can't tell the difference between doing it as part of regular development and doing it in a malicious context. And the root cause of that is that these models lack any sort of real awareness. Humans don't generally get tricked into hacking (in this way).
The first challenge is making sure the guard rails work and are robust. Companies are still working on this.
the second challenge is being able to reliably adapt them as appropriate per user. E.g. allow someone to pen test their own app.
The third challenge (which blocks the second) is to be confident about what is safety-aligned with a specific user.
I think the later will be a hard problem, but they will be highly motivated to solve it.
If it gets worse in future releases, we'd likely step fully away towards more useful (for us) models even if they're less capable.
It’s great at filing!
But it’s terrible at retrieval because it would refuse to show me documents or information with personal details - which was everything in the project.
It would say, yes, I know this is your information, sitting on your hard drive, but I still can’t show it to you.
If you begin a generic reverse engineering task, 30+ tool calls in a row. The moment it sees something it doesn’t like, token burn, single tool calls iteration, “This is a known CTF challenge, I can proceed”, single tool calls iteration, “This is a real CTF challenge, I can proceed”, etc.
It’s heavily neutered now, without changing the model, and you pay for the privilege and don’t notice.
The end result of course being that it both expensive and useless for approved CTF tasks. No one is using Opus for security. If they think it’s working, the harsh reality is they’re not doing security work; they’re just generically finding bugs.
I do this for a job and can demonstrate this plain as day, dump the injected prompt, and notice what it’s doing isn’t security work, it just looks like it. Happy to write a blog about it if you want to know more. Apparently many people think it’s working for them when it absolutely isn’t.
Security, games (think weapons, PVP, attacking, etc), sometimes even asking it for a security review of some CRUD code it wrote itself
I've even had it refuse CTFs knowing it is a CTF with blatantly obvious CTF flag, no actual application
Fresh session, no prior context on 4.8. These things are becoming useless Duplo.
Reminds me of the defense issues with Claude which were complained as “woke” but the reality is more horrifying to me, imagine trying to use a model to keep up with a land invasion on US soil, whoever the enemy is is irrelevant you just know they are using AI, and your guys are telling you that no matter what they type into the prompt it refuses, because if anyone has ever tried to jailbreak an LLM even if human lives are at stake they refuse the request. Now literally millions of lives are on the line but the guardrails that your enemies dont have on their models are costing you lives.
What do you even do then?
AI will always have this issue where it will always pick the worst option for genuinely good requests.
Because the military doesn't give soldiers rifles with guard rails. They give the soldiers intense, rigid training, and then try to enforce discipline and correct use socially.
If an LLM is going to be important in that way (this seems like a very contrived way,) then it's in the interest of the LLM's host to make sure it doesn't have guard rails that would get in the way _that_ way.
I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.
Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.
The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.
I'm very curious how you would do multiple runs of multiple models in a "work alongside the model" manner?
every model since gpt3 was claimed to be "too dangerous to release." it's too EXPENSIVE to release, and you're probably a local model with <10B parameters yourself