Provide agents with automated feedback

(banay.me)

104 points | by ghuntley 1 day ago

14 comments

  • jamesblonde 8 minutes ago
    I got turned off in the first paragraph with the misuse of the term "back pressure". "back pressure" is a term from data engineering to specifically indicate a feedback signal that indicates a service is overloaded and that clients should adapt their behavior.

    Backpressure != feedback (the more general term). And in the agentic world, we use the term 'context' to describe information used to help LLMs make decisions, where the context data is not part of the LLM's training data. Then, we have verifiable tasks (what he is really talking about), where RL is used in post-training in a harness environment to use feedback signals to learn about type systems, programming language syntax/semantics, etc.

  • achou 1 hour ago
    Y'all are sleeping on custom lint rules.

    Every time you find a runtime bug, ask the LLM if a static lint rule could be turned on to prevent it, or have it write a custom rule for you. Very few of us have time to deep dive into esoteric custom rule configuration, but now it's easy. Bonus: the error message for the custom rule can be very specific about how to fix the error. Including pointing to documentation that explains entire architectural principles, concurrency rules, etc. Stuff that is very tailored to your codebase and are far more precise than a generic compiler/lint error.

    • tristandunn 14 minutes ago
      I realized this recently and I've been creating a RuboCop plug-in[1] to automatically have the LLM code better match my personal style. I don't think it'll ever be perfect, but if it saves me from moving a few bits around or adding spacing I'd rather see then it's wroth. The fun part is I'm vibe coding it, since as long as the tests verify the rules then it doesn't really matter much how they work. As a result adding a new rule is pasting in LLM generated code followed by what I'd prefer it look like and asking it to add a rule.

      [1]: https://github.com/tristandunn/rubocop-vibe/

    • esperent 1 hour ago
      Ha, I just had the LLM create my first custom eslint rule yesterday and was thinking that I should make more.
    • esafak 50 minutes ago
      I just discovered https://megalinter.io/
    • oofbey 28 minutes ago
      I like this idea but I can’t think of a concrete example to ground it. Can anybody share a real example?
  • bobjordan 15 minutes ago
    Linters...custom made pre-commit linters which are aligned with your code base needs. The agents are great at creating these linters and then forevermore it can help feedback and guide them. My key repo now has "audit_logging_linter, auth_response_linter, datetime_linter, fastapi_security_linter, fastapi_transaction_linter, logger_security_linter, org_scope_linter, service_guardrails_linter, sql_injection_linter, test_infrastructure_linter, token_security_checker..." basically every time you find an implementation gap vs your repo standards, make a linter! Of course, need to create some standards first. But if you know you need protected routes and things like this, then linters can auto-check the work and feedback to the agents, to keep them on track. Now, I even have scripts that can automatically fix the issues for the agents. This is the way to go.
  • markbao 8 minutes ago
    Yeah, I think designing a system for the LLM to check its own work will replace prompt engineering in key LLM techniques (though, it itself is a form of prompt engineering, but more intentional.) Given that LLMs are doing this today already (with varying success), it might not be long until that’s automated too.
  • qazxcvbnmlp 2 hours ago
    My mental model is that ai coding tools are machines that can take a set of constraints and turn them into a piece of code. The better you get at having it give its self those constraints accurately, the higher level task you can focus on.

    Eg compiler errors, unit tests, mcp, etc.

    Ive heard of these; but havent tried them yet.

    https://github.com/hmans/beans

    https://github.com/steveyegge/gastown

    Right now i spent a lot of “back pressure” on fitting the scope of the task into something that will fit in one context window (ie the useful computation, not the raw token count). I suspect we will see a large breakthrough when someone finally figures out a good system for having the llm do this.

    • AnonyX387 14 minutes ago
      > Right now i spent a lot of “back pressure” on fitting the scope of the task into something that will fit in one context window (ie the useful computation, not the raw token count). I suspect we will see a large breakthrough when someone finally figures out a good system for having the llm do this.

      I've found https://github.com/obra/superpowers very helpful for breaking the work up into logical chunks a subagent can handle.

  • thomasfromcdnjs 1 hour ago
    I've been slowly working on https://blocksai.dev/ which is a framework for building feedback loops for agentic coding purposes. It just exposes a CLI that can run custom validators against anything with a spec in the middle. It's goal being like the blog post is to make sure their is always a feedback loop for the agent, be it programmatic test, semantic linting, visual outputs, anything!
  • skybrian 3 hours ago
    This jumps to proof assistants and barely mentions fuzzing. I've found that with a bit of guidance, Claude is pretty good at suggesting interesting properties to test and writing property tests to verify that invariants hold.
    • tungsten_metal 56 minutes ago
      Proof assistants are the most extreme example of validation that leads you being able to trust the output (so long as the problem you intended on solving was correctly described) but fuzzing and property based testing are definitely more approachable and appropriate in most cases.
    • ekidd 3 hours ago
      If you give Claude examples of good and bad property tests, and explain why, it gets much better than it was out of the box.
  • visarga 1 hour ago
    Well said, I have been saying the same. Besides helping agents code, it helps us trust the outcome more. You can't trust a code not tested, and you can't read every line of code, it would be like walking a motorcycle. So tests (back pressure, deterministic feedback) become essential. You only know something works as good as its tests show.

    What we often like to do in a PR - look over the code and say "LGTM" - I call this "vibe testing" and think it is the real bad pattern to use with AI. You can't commit your eyes on the git repo, and you are probably not doing as good of a job as when you have actual test coverage. LGTM is just vibes. Automating tests removes manual work from you too, not just make the agent more reliable.

    But my metaphor for tests is "they are the skin of the agent", allow it to feel pain. And the docs/specs are the "bones", allow it to have structure. The agent itself is the muscle and cerebellum, and the human in the loop is the PFC.

    • wcarss 1 hour ago
      For anyone else who briefly got very lost at PFC, probably "prefrontal cortex".
  • sh3rl0ck 3 hours ago
    Beyond Linting and Shell Exec (gh, Playwright etc), what other additional tools did you find useful for your tasks, HN?!

    Most of my feedback that can be automated is done either by this or by fuzzing. Would love to hear about other optimisations y'all have found.

    • __MatrixMan__ 1 hour ago
      I like to generate clients with type hints based on an openapi spec so that if the spec changes, the clients get regenerated, and then the type checker squawks if any code is impacted by the spec change.

      There are also openapi spec validators to catch spec problems up front.

      And you can use contract testing (e.g. https://docs.pact.io/) to replay your client tests (with a mocked server) against the server (with mocked clients)--never having to actually spin up both a the same time.

      Together this creates a pretty widespread set of correctness checks that generate feedback at multiple points.

      It's maybe overkill for the project I'm using it on, but as a set of AI handcuffs I like it quite a bit.

    • esafak 49 minutes ago
      I've started incorporating checks into commit hooks, shifting testing left. https://hk.jdx.dev/
    • alphax314 2 hours ago
      Running all shorts of tests (e2e, API, unit) and for web apps using the claude extension with chrome to trigger web ui actions and observe the result. The last part helps a lot with frontend development.
    • sigseg1v 3 hours ago
      Teaching them skills for running API and e2e tests and how to filter those tests so it can check if what it did works quickly.
  • jackblemming 23 minutes ago
    I think the standard terminology for this are harnesses. No reason to invent some new term.
  • anditherobot 2 hours ago
    With Visual Studio and Copilot I like the fact that runs a comment and then can read the output back and then automatically continues based on the error message let's say there's a compilation error or a failed test case, It reads it and then feeds that back into the system automatically. Once the plan is satisfied, it marks it as completed
  • dang 1 hour ago
    People have been complaining about the title.* To avoid getting into a loop about that, I've picked a phrase from the article which I think better represents what it's saying. If there's a better title, we can change it again.

    * (I've moved those comments to https://news.ycombinator.com/item?id=46675246. If you want to reply, please do so there so we can hopefully keep the main thread on topic.)

  • dang 1 hour ago
    [stub for offtopicness]
    • waterproof 3 hours ago
      "Back pressure" is already a term widely used in computing for something entirely different: https://schmidscience.com/what-does-back-pressure-in-compute...
      • swader999 25 minutes ago
        Yeah it's too bad the author chose that word. They are in to something though, is a useful way to think about this game.
      • jagged-chisel 3 hours ago
        I have the same argument with “crypto”
        • andai 2 hours ago
          And web 3? ;)
      • johnfn 2 hours ago
        I am not sure if I am missing something, since many people have made this comment, but isn't this in some ways similar to the shape of the traditional definition of back pressure, and not "entirely different"? A downstream consumer can't make its work through the queue of work to be done, so it pushes work back upstream - to you.
    • jandrewrogers 3 hours ago
      This use of the term “back pressure” is pretty confusing in a computer science context.
      • cortesoft 2 hours ago
        Yeah, I spent way too long trying to think of how what the author was talking to was related to back pressure... I had a very stretched metaphor I was going with until I realized he wasn't talking about back pressure at all
    • asmvolatile 2 hours ago
      Back pressure is not a good name for this. You already listed one that makes more sense - “feedback”
    • refulgentis 3 hours ago
      Others have pointed out the incongruity of back pressure here, I would have loved “feedback”.
    • didip 3 hours ago
      I thought you are talking about back pressure pipes in my housing complex.

      I’ve been wondering why I can’t use it to generate electricity.