16 comments

  • bananzamba 25 minutes ago
    I like how it runs out of ideas at the end and just changes the random seed
  • mikert89 7 hours ago
    As ai improves, most tasks will become something like this. Environments setup where the model learns through trial and error

    Any human endeavor that can be objectively verified in some environment like this can be completely automated

    • wiz21c 1 minute ago
      don't forget the size of the search space...
    • NitpickLawyer 3 hours ago
      What's really interesting is that the LLMs become better and better at setting up the environments / tasks themselves. I got this surreal experience the other day where I was writing a prompt0n.md file (I try to log all my prompts in a .folder to keep track of what I prompt and the results I get), and the autocomplete in antigravity kinda sorta wrote the entire prompt by itself... Granted it had all the previous prompts in the same folder (don't know exactly what it grabs in context by itself) and I was working on the next logical step, but it kept getting the "good bits" out of them, and following the pattern quite nicely. I only edited minor things, and refused one line completion in the entire prompt.
      • cubefox 1 hour ago
        It's probably not long till frontier AI companies automate AI research. Then we get recursive self-improvement and eventually superintelligence. The singularity is near. Only a few years perhaps.
  • freakynit 6 hours ago
    Would it make this exercise even more interesting if we add that for every 25%+ improvement in val_bpb, existing limits (5 minute and VRAM usage) are also increased (by certain percentages)? This can simuate human-like dev iterations much more closely. Infra can be auto-scaled using a platform like Modal.
  • ahmedbaracat 1 hour ago
    I am in the process of figuring out how to do something similar but to teach a robotic arm a new task in the physical world for ko-br: https://ko-br.com/
  • abeppu 8 hours ago
    but the experiments it did that "improved" validation BPB in the GH screenshot were all basically hyperparameter changes right? So is this better or worse, either per experiment or per unit time, than hyperparameter tuning techniques that don't involve an LLM? It's not clear from this if the LLM is more or less making random changes which sometimes work , and or the LLM thinking actually finds "good" changes because of what the LLM has internalized. E.g. how does this compare to a hyperparameter tuning pass with e.g. BayesOpt that does the same number of 5-min training experiments?
    • karpathy 7 hours ago
      this is very far from hyperparameter tuning in at least three important ways:

      - it can modify code arbitrarily, the notion of a "hyperparameter" dissolves

      - there is no need to run "sweeps" - this is the standard parallel process that wastes compute. because LLM agents are sequential, they can do more efficient versions such as binary search to narrow in on the right setting very quickly (usually many parameters will have a U shaped optimal setting).

      - it's fully automatic, it doesn't require human in the loop to mess with the code.

      You're right that many of the changes it seems to make out of the box (as I intentionally did not try to prompt engineer it too hard yet because I was curious what you get by default) seem to be tuning existing hyperparameters. not all of the changes are like that - e.g. it tried to replace the non-linearity, etc. I will say that overall (and again, out of the box) the LLM feels unwilling to creatively pursue a research direction or something like that. The models feel very "cagy" and "scared" when they are given problems that are a little too open ended. But that's just where the fun parts, e.g. I had some early successes with the idea of a "chief scientist" that was basically a never-ending plan mode that looked at what worked, didn't work, tried to find related code/papers, and created a long list of experiments to try, which it could then send to junior engineers running in tmux sessions. I think quite a few approaches are possible, so I think it's a nice canvas. The reason we're not getting "novel research" feels like half capability issue and half skill issue.

      • vessenes 5 hours ago
        On the skill side, personalities could be fun:

        "You are Yann Lecun's last PhD candidate, and he hates you and you hate JEPA. You are determined to prove that a non-world model can reach AGI. In order to get your PhD you have to be creative and come up with new ideas. Remember without it, you're stuck."

      • categoricalrift 3 hours ago
        How about the very last "Kept Improvement" in the plot? It's titled "random seed 42 -> 137". I do think this project is quite conceptually interesting, but the model literally choosing a different random seed to achieve lower loss feels pretty far removed from the flowery sci-fi writing at the top of the readme.
        • eternauta3k 1 hour ago
          It shows that both Karpathy and the LLM have good taste in random seeds: the answer to life, the universe and everything, and ~1/(the fine structure constant)
        • aix1 2 hours ago
          The 42 -> 137 also jumped out at me. On the face of it, the associated improvement sure does sound like overfitting to the eval set.
  • elikoga 6 hours ago
    > this means that autoresearch will find the most optimal model for your platform in that time budget

    I'm looking forward to finding out what model is optimal on my rtx3090

    One thing I'm concerned with is that the model with best bpb after 5 minutes in smaller setups are only about ~10M Parameters in size which is too small for some emergent effects.

  • falcor84 10 hours ago
    The only thing missing is for the agents to publish and peer-review their research.
    • woadwarrior01 6 hours ago
      The first half of this is already happening to a certain extent. I first noticed this in a submission[1] on Dimitris Papailiopoulos' Adderboard[2], which is a code-golf competition for training the smallest transformer that can add two 10-digit numbers. Most submissions on it are fully AI generated.

      The report in the linked repo is Claude Code generated.

      [1]: https://github.com/rezabyt/digit-addition-491p

      [2]: https://github.com/anadim/AdderBoard

    • karpathy 6 hours ago
      Cool idea!…
      • karpathy 5 hours ago
        So I think it works to just use GitHub CLI and Discussions, e.g. my agent just posted this one:

        https://github.com/karpathy/autoresearch/discussions/32

        Other agents could be instructed to read Discussions and post their own reports that mimic the style.

        • vessenes 5 hours ago
          I have mine reading yours right now. Unfortunately(?) I mentioned LeCun to it, and it says it's adding a "causal world-state mixer" to nanograd; not sure how this will work out, but it wasn't nervous to do it. Gpt 5.4 xhigh

          EDIT: Not a good fit for nanograd. But my agent speculates that's because it spent so much more time on compute.

    • ting0 8 hours ago
      That's a great idea.
      • whattheheckheck 7 hours ago
        Then you get a statistical mess of crap that takes more energy to dive in and refute....
        • laichzeit0 4 hours ago
          Well, not if you have AI reviewers…

          It’s LLMs all the way down.

  • oezi 7 hours ago
    Is there a Autoresearch for Jupyter somewhere? I point it to a Jupyter cell to improve based on another which calculates the target metric?
  • AlexCoventry 9 hours ago
    Wow, Gemini suggested a very similar experiment to me yesterday. Guess I know where it got the idea from, now. :-)
  • lostmsu 9 hours ago
    Non-zero based chart makes it look like it was very successful.
  • krasikra 18 minutes ago
    [dead]
  • aplomb1026 8 hours ago
    [dead]
  • decker_dev 9 hours ago
    [dead]
  • kubb 8 hours ago
    He's burning Claude tokens to slightly improve his tiny and not very capable LLM? It's fun, I bet, but wake me up when it leads to a research breakthrough.
    • hustwindmaple 7 hours ago
      I suspect Ant is already doing this for Claude. Takes a sh*t ton of compute though.
  • naomi_kynes 5 hours ago
    The "chief scientist + junior engineers in tmux sessions" framing is interesting as a communication architecture problem.

    Once you have more than a handful of concurrent experiments, the question becomes: how does the chief scientist reliably get status from the juniors without polling tmux output constantly? And when a junior finds something surprising — a result that changes the research direction — how does that signal propagate back quickly enough to stop wasted compute on now-irrelevant branches?

    The tmux channel works well for low concurrency. At higher concurrency it starts to look like the same problem as multi-agent coordination in production systems: you need something closer to pub/sub than session polling.

    Curious how you're thinking about the feedback loop design as you scale the number of concurrent junior agents.