I hate compilers

(xeiaso.net)

94 points | by xena 6 hours ago

18 comments

  • edude03 3 minutes ago
    Could have sworn the author was a nix(os) user already. I know it’s a meme but what all the problems they’re describing literally is solved by nix. The nix sandbox even catches calls for time for example to replace it with 0 for determinism.
  • antirez 11 minutes ago
    So to avoid those energy-hungry LLM companies from scraping your website, you force each browser to compute a lot of hashes in a necessarily energy-hungry loop, creating, at the same time, all the kind of accessibility problems?
  • inigyou 41 minutes ago
    Better title: Reproducible builds are hard
  • randusername 28 minutes ago
    I've seen posts by this author before and did not understand if the commentary characters were referential or a creation of the author. Turns out its the latter. I dismissed the underlined names as just styling, not hyperlinks.

    https://xeiaso.net/characters/

  • jdw64 6 hours ago
    Reading this, I think low level engineering is actually more dependent on specific environments. Hardware also has its own points of change. Usually, when you think at a high level, environmental changes are less significant than you might expect. But low level thinking tends to be tied to specific environments, which is what makes it difficult. The reason low level is hard is that even if the code itself is short, the hidden assumptions inside it are difficult and place a heavy cognitive load on the programmer. For example, even a short snippet in C like `int value = (int)buffer` requires a lot of implicit knowledge about the 4 byte alignment of the buffer, or whether int is exactly 32 bits. LLMs do not seem to be very good at knowing these things. Rather, they are strong at high level wrapping, but at the low level, they seem surprisingly difficult and somewhat useless. Hardware has CPU generation changes, and in the case of PLCs, where I mainly work, the protocol differences between vendors are far too severe. There does not seem to be any technology with a very long lifecycle.
    • jstimpfle 5 hours ago
      Depends on what you mean by low level I guess. Compared to web application framework churn rate, simple procedural programming without many dependencies is remarkably stable. You tend to program in a way that works for most platforms (all targetted platforms). How to best do that you learn over the years. To me personally it's very refreshing if the environment around you does not constantly change. That affords learning a bag of tricks and a list of gotchas to avoid.
      • embedding-shape 1 hour ago
        > You tend to program in a way that works for most platforms (all targetted platforms).

        Isn't that true for web frameworks too? Usually they'll only target unix, but if they target windows and macos, then they work on those platforms too? Or am I misunderstanding what you're trying to say here?

        • jstimpfle 1 hour ago
          This is how I mean it: In case of low level programming, the "platform" is the hardware/OS/compiler. In case of web programming, the "platform" is the web framework.

          If you update the OS, hardware, or compiler, you will see only few changes. If you update the web framework, you may see breakages, API deprecations or whatever. You may want to move to a different web framework entirely. TBH I don't really know, I don't know web programming beyond basic HTML/Javascript. That's what they say, though.

          • embedding-shape 1 hour ago
            Well I mean you're comparing two different solutions at different layers here.

            In the case of an desktop application, unless you build things against OS libraries, your "platform" is also typically a framework, like QT or AppKit or whatever you end up using. That's the equivalent of the "web framework" in the web world.

            Basically, it goes "Your app > GUI framework > other/OS libraries" for desktop apps, "Your app > web framework > other/OS libraries" for web applications.

            Then in both approaches you can of course skip the framework if you want, no one is forcing you to use those in either of the cases.

            Edit: I realize now we might be talking past each other, I was under the understanding that "web framework" is about backend web frameworks, but maybe you actually meant frontend frameworks running client-side. If so, replace "other/OS libraries" with "browser runtime" and my comment more or less still makes sense :)

            • jstimpfle 44 minutes ago
              > your "platform" is also typically a framework, like QT or AppKit or whatever you end up using

              That's not what I consider "low level programming". I don't use any of these.

              Yes you can do try and do plain Javascript. Honestly Javascript is a much less pleasurable environment than a compiled statically typed procedural language. The main advantage of the browser is you get a viewport, you get font rendering etc. with almost no setup required at all.

          • AnimalMuppet 1 hour ago
            More: If you upgrade the hardware or the compiler, you upgrade them. If you're doing web programming, you have to worry about the user upgrading their browser.
            • jstimpfle 47 minutes ago
              It's not so much about the browser (I'm not aware of major incompatibilities introduced by new browsers or new W3C standards). But the software ecosytems (like frameworks, or node.js) that web people are relying on in order to create their web apps.
      • jdw64 4 hours ago
        I think you're right too. So I also think that maybe I'm viewing the changes as bigger than they actually are, based on my own standards
    • Dwedit 28 minutes ago
      Looks like the formatting ate your asterisks at *(int*)buffer. Use \* to get an asterisk.
  • crvdgc 5 hours ago
    Nix also needs the build output to be deterministic to calculate the hash. It also has the problems of timestamps etc. The build environment tries to be hermetic by setting the time to be epoch among other things.
    • mplanchard 47 minutes ago
      Yes, reading this I was thinking about how many of these problems go away with a nix environment. Certainly not all of them, but it’s a great way to get a reproducible build environment that includes direct specification of system dependencies.
    • lloeki 4 hours ago
      SOURCE_DATE_EPOCH is not a Nix thing

      https://reproducible-builds.org/docs/source-date-epoch/

      (although Nix sets it as a default)

    • stabbles 3 hours ago
      Nix hashes the build inputs, for which deterministic builds are not required, only desirable.
  • ComputerGuru 6 hours ago
    These seem very reasonable, the workarounds used are natural, and overall the article is not at all congruous with the conclusion in the (clickbait?) title?

    Compilers literally made your project possible!

    • zeratax 2 hours ago
      > Clang relies on address layout for ordering things

      I would consider that a bug tbh

  • biglost 5 hours ago
    Time date env variables and random address... Is also input data, maybe not as a flag but still
    • RyanSquared 5 hours ago
      Time and date are... tolerable. There's SOURCE_DATE_EPOCH which should always be set to whack it into submission when used. ASLR of the _compiler being invoked_ resulting in a difference in the _program being compiled_ is nuts and would break any self-hosting compiler with consistency checks.
    • yjftsjthsd-h 5 hours ago
      Explicit is Better than Implicit.
  • swiftcoder 5 hours ago
    The Birth and Death of Javascript really had the gift of prophecy, eh
    • neocron 3 hours ago
      Was that the thing where Gary predicted js in the kernel?
      • swiftcoder 2 hours ago
        yes, although more directly relevant here, chrome-compiled-to-wasm-nested-in-chrome
  • evmar 1 hour ago
    A better solution might be to use https://github.com/evanw/polywasm to run the original wasm in place.
  • pertymcpert 6 hours ago
    If Clang generated non-deterministic output due to pointer addresses then that's a bug (happens regularly) that should be fixed. The most common way this happens if it some code path is iterating over a DenseMap which is non-deterministic. Sometimes that's fine and sometimes that's not depending on how that map is used. The common way to fix that is to switch to a MapVector which pays some additional runtime/memory cost to guarantee deterministic iteration order.
    • xena 5 hours ago
      I'll try and make a minimal reproduction case and file a bug. Do you know if any tooling that can take a binary and fuzz it down to a minimal reproduction set?
  • Animats 4 hours ago
    I hate proof of work code running on my machine for the benefit of someone else. It's like planting a crypto miner.
    • ctrlmeta 37 minutes ago
      Do proof-of-work pages actually stop AI bots? Big AI companies have enough compute to solve these challenges at scale. And if their bots are already doing much heavier work to fetch, read and process each page, then solving a small challenge first seems unlikely to be a serious barrier. Who are these proof-of-work challenges actually helping?
    • tengwar2 50 minutes ago
      Yup, and I suspect that even if OP is honest in this respect, if proof-of-work gets established as a normal practice for web pages, it's going to be used this way.

      But just taking this as-is, what is the environmental impact likely to be when multiplied up by the number of users? Proof of work is a bad idea.

    • account42 1 hour ago
      Yes, all these kind of bot checks are essentially malware.
  • znpy 4 hours ago
    > What do you do when the client has WebAssembly disabled?

    > I decided to take inspiration from the legendary talk The Birth and Death of JavaScript and just recompile the WebAssembly to JavaScript.

    So what do you do when the client has Javascript disabled ?

  • sylware 5 hours ago
    To avoid all those grotesque and absurd compilers and runtimes, more for those of computer languages with a ultra-complex syntax (c++ and similar), I now design "binary specifications" which I "design" and "validate" with RISC-V assembly coding.

    Here, since any whatwg cartel web engine is an issue, the author should not bother.

  • charcircuit 6 hours ago
    As long as the program is equivalent there isn't an actual problem here. Requiring the output to always be the same is an arbitrary restriction.

    If you want to have users trust that someone else hasn't modified it, then sign it with your identity.

    • yjftsjthsd-h 5 hours ago
      We'd like to verify, not trust.
      • charcircuit 4 hours ago
        The whole point of a signature is that you are able to verify what was signed was in fact a message that was signed by signer.
        • robinsonb5 4 hours ago
          Sure, but a signature doesn't prove that a particular binary came from a particular codebase - merely that a particular human (or other trusted entity, for varying degrees of "trusted") has vouched for it.

          Being able to reproduce the binary from the source code and being able to verify that it's the same as the original is quite important in some contexts.

          • skydhash 12 minutes ago
            > Being able to reproduce the binary from the source code and being able to verify that it's the same as the original is quite important in some contexts

            Why not build your own binaries and be done with that. If you don’t trust the compiler or the machine doing the build, just build the code yourself.

          • charcircuit 4 hours ago
            >Being able to reproduce the binary from the source code and being able to verify that it's the same as the original is quite important in some contexts.

            I disagree. The contexts that people come up with are purely theoretical, and are not practically important. Please do try and convince me otherwise by sharing such a context. From my view the juice of trying to accomplish this is no where worth the squeeze.

            • harrouet 28 minutes ago
              You disagree but you're wrong.

              Military context: a government would want to review the code and compile themselves. Provide a hash of the target binary to ensure they've compiled it correctly.

              SDLC: provide auditors with _proof_ that the tested binary is indeed coming from the audited code

  • mathisfun123 6 hours ago
    [flagged]
    • xena 6 hours ago
      • heavensteeth 6 hours ago
        I'm surprised by the amount of heckling this post received almost immediately! And a lack of constructive input.

        I for one enjoyed the article and understand what you're getting at.

    • yjftsjthsd-h 5 hours ago
      > This is the goofiest I've seen written unironically in quite a long - the C preprocessor is not part of the compiler. The pre in preprocessor should probably give it away.

      This is true but doesn't seem relevant; does replacing the word "compiler" with "build chain" change anything? Because that seems like the clear meaning.

    • LPisGood 5 hours ago
      Re: source code producing different binaries: things like ASLR, stack canaries, optimization levels, linking, etc all lead to different binaries.
  • ekjhgkejhgk 1 hour ago
    [flagged]
  • dyauspitr 6 hours ago
    LLMs should be trained on and directly output binary.
    • klodolph 6 hours ago
      On the off chance that you’re serious, that would result in disastrously bad output. The difference between “jmp $+15” and “jmp $+16” is inscrutable and the LLM would not be able to pick the right one without tooling.

      That tooling is a compiler. The higher level, the better chance the LLM can be steered to good output. Machine code is hopeless, don’t bother.

      • pjmlp 6 hours ago
        That compiler does wonders with languages that have UB on their specs, especially when having optimizations passes with heuristics.

        Also there are dynamic compilers were the shape of machine code changes as the code executes, and each single execution will certainly generate different sequences, depending on the program execution and where it is running.

        Deterministic JIT compiler code generation, at least on optimising ones, is not a solved problem.

      • faangguyindia 6 hours ago
        What about AOT optimization? whuch brings aot closer to JITs performance? Isn't that something LLM + Harness can easily do?
        • klodolph 5 hours ago
          I think the idea that AOT is inherently faster than JIT, or vice versa, is a thoroughly debunked idea.

          You can have LLMs help you optimize code but I don’t think you can do this unattended for non-trivial code.

      • jenadine 6 hours ago
        > The difference between “jmp $+15” and “jmp $+16” is inscrutable

        I don't see why that's the case. LLM trained on binary would totally see it, not?

        Also the tool can also be running the test and a debugger.

        • klodolph 5 hours ago
          > I don't see why that's the case. LLM trained on binary would totally see it, not?

          It would not. You find the correct version by counting the number of bytes to the destination. LLMs are famously bad at this kind of problem (counting).

          > Also the tool can also be running the test and a debugger.

          The test needs to provide a good amount of signal. That’s too hard if you are throwing machine code at the wall.

          In order for debuggers to work, you need some kind of model that describes what the code should do and what state the computer should be in after each instruction. That model is high-level code.

          I can understand the intuitive appeal of training LLMs with machine code, but all of my experience with LLMs suggest that they are incredibly ill-suited to the task, and we just don’t have the capacity to train them to make useful machine code.

          • zx8080 5 hours ago
            Can "LLMs are bad at counting" be generalized to "LLM are better in complex stuff but make more mistakes in simple"?
            • fluoridation 5 hours ago
              I would phrase it as "LLMs are good at big picture stuff and bad at fine detail", or to put it another way, they're accurate, but imprecise and with low reproducibility.
              • bregma 1 hour ago
                It is my experience that it's the opposite. LLMs are very very precise but wildly inaccurate. They might give you 17 significant digits but be off by 10 orders of magnitude, to use a metaphor.
              • benj111 1 hour ago
                But where does that leave us when programmers treat themselves as architects with the AI doing the drudge work? As seems to be the fashion.

                It then means you have 2 parties focussing on the big picture and no one focussing on the details.

            • ozlikethewizard 5 hours ago
              Its more LLMs are better at vague problems with multiple non perfect solutions, and struggle at problems that require precision.
            • klodolph 5 hours ago
              No, I don’t think so. LLMs are good at a lot of simple tasks, but bad at certain simple tasks. Moravec’s paradox in a new iteration.

              It applies to humans too. Calculus is “simple” but it takes something like sixteen years to train a human to do it, if all goes well. Meanwhile, most humans think that inverse kinematics is, like, the easiest thing in the world (it’s a super complicated task).

              • fluoridation 5 hours ago
                Calculus is definitely the harder task, considering it took a species developing the cognitive capacity for symbolic reasoning for it to show up, whereas any animal can figure out how to position its limbs. Yeah, we figured out how to make CAS programs before inverse kinematics software, but that's because computers were made to solve numerical problems, not to replace the cerebella of chordates.
        • dezgeg 3 hours ago
          Even if it could, it would be ridiculously token inefficient to update huge amount of addresses instead when some small change is done to the middle of a binary
    • xiaoyu2006 6 hours ago
      It should not. Abstraction in software engineering brings intelligence. (compression correlates to intelligence)
      • shshshjaja 6 hours ago
        runApp()

        Done! Excellent abstraction. High intelligence.

      • frwrfwrfeefwf 5 hours ago
        people don't get this
      • dyauspitr 6 hours ago
        Why? I mean this is all emergent, right? And it’s not like humans ever work at this level. It would be very interesting to see what sort of outputs and abstractions an LLM comes up with.
    • bandrami 5 hours ago
      Generative algorithms have been studied for decades now and while they have led to some interesting results they're a bad fit for LLMs because there's no such thing as a "plausible" binary: a small perturbation yields an unusable result.
    • fulafel 5 hours ago
      Technically they are, just a subset. But still a practical one, they're frequently used to produce executable files.
    • rvz 6 hours ago
      I think you forgot the "/s"
    • wahnfrieden 6 hours ago
      [flagged]
    • junior44660 5 hours ago
      [flagged]