8 comments

  • simonw 15 minutes ago
    Comments like this don't fill me with confidence: https://github.com/brexhq/CrabTrap/blob/4fbbda9ca00055c1554a...

      // The policy is embedded as a JSON-escaped value inside a structured JSON object.
      // This prevents prompt injection via policy content — any special characters,
      // delimiters, or instruction-like text in the policy are safely escaped by
      // json.Marshal rather than concatenated as raw text.
  • yakkomajuri 3 hours ago
    Really cool! I'm also building something in this space but taking a slightly different approach. I'm glad to see more focus on security for production agentic workflows though, as I think we don't talk about it enough when it comes to claws and other autonomous agents.

    I think you're spot on with the fact that it's so far it's been either all or nothing. You either give an agent a lot of access and it's really powerful but proportionally dangerous or you lock it down so much that it's no longer useful.

    I like a lot of the ideas you show here, but I also worry that LLM-as-a-judge is fundamentally a probabilistic guardrail that is inherently limited. How do you see this? It feels dangerous to rely on a security system that's not based on hard limitations but rather probabilities?

  • roywiggins 2 hours ago
    It's all fine until OpenClaw decides to start prompt injecting the judge
    • bambax 58 minutes ago
      Exactly; would probably be safer with a purely algorithmic decision making system.
    • fc417fc802 1 hour ago
      Calling it now. Show HN: Pincer - A small highly optimized local model to detect prompt injection attempts against other models.
      • reassess_blind 29 minutes ago
        Sounds like a good idea. Please send me the Github link once done and I'll have my OpenClaw take a look and form my opinion of it.
  • fareesh 40 minutes ago
    Needs to be deterministic. ACLs
    • erdaniels 19 minutes ago
      Yes, full stop. They say they cap the body to 16k and give the LLM a warning, lol. And this is coming from a credit card company.
  • Seventeen18 1 hour ago
    So cool ! I'm building something very close to that but from another perspective, making this open source is giving me many idea !
  • DANmode 3 hours ago
    We’re supposed to be fixing LLM security by adding a non-LLM layer to it,

    not adding LLM layers to stuff to make them inherently less secure.

    This will be a neat concept for the types of tools that come after the present iteration of LLMs.

    Unless I’m sorely mistaken.

    • reassess_blind 3 hours ago
      It looks as if this tool has traditional static rules to allow/deny requests, as well as a secondary LLM-as-a-judge layer for, I imagine, the kinds of rules that would be messy or too convoluted to implement using standard rules.
      • stingraycharles 49 minutes ago
        I think the parent’s point is that this should be implemented using e.g. Bayesian statistics rather than an LLM, as the judge LLM is vulnerable to the exact same types of attacks that it’s trying to protect against.

        Most proper LLM guardrails products use both.

    • nl 2 hours ago
      > We’re supposed to be fixing LLM security by adding a non-LLM layer to it,

      If people said "we build a ML-based classifier into our proxy to block dangerous requests" would it be better? Why does the fact the classifier is a LLM make it somehow worse?

      • Retr0id 2 hours ago
        The fact that LLMs are "smarter" is also their weakness. An oldschool classifier is far from foolproof, but you won't get past it by telling it about your grandma's bedtime story routine.
        • reassess_blind 28 minutes ago
          Fairly hard to bypass the latest LLMs with grandma's bedtime story these days, to be fair.
          • Retr0id 21 minutes ago
            That specific trick yes, but the general concept still applies.
      • waterTanuki 2 hours ago
        If you're working in a mission-critical field like healthcare, defense, etc. you need a way to make static and verifiable guarantees that you can't leak patient data, fighter jet details etc. through your software. This is either mandated by law or in your contract details.

        The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.

        You might say this hasn't prevented leaks/CVEs in exisiting mission-critical software and this would be correct. However, the people writing the checks do not care. You get paid as long as you follow the spec provided. How then, in a world which demands rigorous proof do you fit in an LLM judge?

    • snug 3 hours ago
      I think this can be great as additional layer of security. Where you can have a non llm layer do some analysis with some static rules and then if something might seem phishy run it through the llm judge so that you don’t have to run every request through it, which would be very expensive.

      Edit: actually looks like it has two policy engines embedded

      • windexh8er 3 hours ago
        And we don't think the judge can/will be gamed? Also... It's an LLM, it's going to add delay and additional token burn. One subjective black box protecting another subjective black box. I mean, what couldn't go wrong?
      • ImPostingOnHN 3 hours ago
        What happens when a prompt injection attack exploits the judge LLM and results in a higher level of attacker control than if it never existed?
        • vova_hn2 2 hours ago
          How can it result in a higher level of control? I don't see why the "judge" should have access to anything except one tool that allows it to send an "accept" or "deny" command.
    • SkyPuncher 3 hours ago
      Defense in depth. Layers don't inherently make something less secure. Often, they make it more secure.
      • yakkomajuri 3 hours ago
        I do think this is likely to make things more secure but it's also dangerous by potentially giving users a false sense of complete security when the security layer is probabilistic rather than deterministic.

        EDIT: it does seem to have a deterministic layer too and I think that's great

  • alukin 1 hour ago
    [dead]
  • kantaro 2 hours ago
    [dead]