Making AMD GPUs competitive for LLM inference (2023)

(blog.mlc.ai)

241 points | by plasticchris 18 hours ago

19 comments

  • pavelstoev 14 hours ago
    The problem is that performance achievements on AMD consumer-grade GPUs (RX7900XTX) are not representative/transferrable to the Datacenter grade GPUs (MI300X). Consumer GPUs are based on RDNA architecture, while datacenter GPUs are based on the CDNA architecture, and only sometime in ~2026 AMD is expected to release unifying UDNA architecture [1]. At CentML we are currently working on integrating AMD CDNA and HIP support into our Hidet deep learning compiler [2], which will also power inference workloads for all Nvidia GPUs, AMD GPUs, Google TPU and AWS Inf2 chips on our platform [3]

    [1] https://www.jonpeddie.com/news/amd-to-integrate-cdna-and-rdn.... [2] https://centml.ai/hidet/ [3] https://centml.ai/platform/

    • llm_trw 11 hours ago
      The problem is that the specs of AMD consumer-grade GPUs do not translate to computer performance when you try and chain more than one together.

      I have 7 NVidia 4090s under my desk happily chugging along on week long training runs. I once managed to get a Radeon VII to run for six hours without shitting itself.

      • mpreda 5 hours ago
        > I have 7 NVidia 4090s under my desk

        I have 6 Radeon Pro VII under my desk (in a single system BTW), and they run hard for weeks until I choose to reboot e.g. for Linux kernel updates.

        I bought them "new old stock" for $300 apiece. So that's $1800 for all six.

        • highwaylights 4 hours ago
          How does the compute performance compare to 4090’s for these workloads?

          (I release it will be significantly lower, just try to get as much of a comparison as is possible).

          • crest 2 hours ago
            The Radeon VII is special compared to most older (and current) affordable GPUs in that it used HBM giving it memory bandwidth comparable to modern cards ~1TB/s and has reasonable FP64 (1:4) throughput instead of (1:64). So this card can still be pretty interesting for running memory bandwidth intensive FP64 workloads. Anything affordable afterward by either AMD or Nvidia crippled realistic FP64 throughput to below what a AVX-512 many-core CPU can do.
            • nine_k 54 minutes ago
              If we speak about FP64, are your loads more like fluid dynamics than ML training?
          • cainxinth 4 hours ago
            The 4090 offers 82.58 teraflops of single-precision performance compared to the Radeon Pro VII's 13.06 teraflops.
            • adrian_b 2 hours ago
              On the other hand, for double precision a Radeon Pro VII is many times faster than a RTX 4090 (due to 1:2 vs. 1:64 FP64:FP32 ratio).

              Moreover, for workloads limited by the memory bandwidth, a Radeon Pro VII and a RTX 4090 will have about the same speed, regardless what kind of computations are performed. It is said that speed limitation by memory bandwidth happens frequently for ML/AI inferencing.

      • tspng 10 hours ago
        Wow, are these 7 RTX 4090s in a single setup? Care to share more how you build it (case, cooling, power, ..)?
        • ghxst 7 hours ago
          You might find the journey of Tinycorp's Tinybox interesting, it's a machine with 6 to 8 4090 GPUs and you should be able to track down a lot of their hardware choices including pictures on their Twitter and other info on George his livestreams.
        • llm_trw 6 hours ago
          Basically this but with an extra card on the x8 slot for connecting my monitors: https://www.youtube.com/watch?v=C548PLVwjHA

          There's a bunch of similar setups and there are a couple of dozen people that have done something similar on /r/localllama.

        • osmarks 5 hours ago
          Most of these are just an EPYC server platform, some cursed risers and multiple PSUs (though cryptominer server PSU adapters are probably better). See https://nonint.com/2022/05/30/my-deep-learning-rig/ and https://www.mov-axbx.com/wopr/wopr_concept.html.
          • Keyframe 5 hours ago
            Looks like a fire hazard :)
        • adakbar 8 hours ago
          I'd like to know too
    • zozbot234 9 hours ago
      It looks like AMD's CDNA gpu's are supported by Mesa, which ought to suffice for Vulkan Compute and SYCL support. So there should be ways to run ML workloads on the hardware without going through HIP/ROCm.
  • shihab 17 hours ago
    I have come across quite few startups who are trying a similar idea: break the nvidia monopoly by utilizing AMD GPUs (for inference at least): Felafax, Lamini, tensorwave (partially), SlashML. Even saw optimistic claims like CUDA moat is only 18 months deep from some of them [1]. Let's see.

    [1] https://www.linkedin.com/feed/update/urn:li:activity:7275885...

    • pinsiang 14 hours ago
      AMD GPUs are becoming a serious contender for LLM inference. vLLM is already showing impressive performance on AMD [1], even with consumer-grade Radeon cards (even support GGUF) [2]. This could be a game-changer for folks who want to run LLMs without shelling out for expensive NVIDIA hardware.

      [1] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html [2] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...

    • ryukoposting 17 hours ago
      Peculiar business model, at a glance. It seems like they're doing work that AMD ought to be doing, and is probably doing behind the scenes. Who is the customer for a third-party GPU driver shim?
      • dpkirchner 17 hours ago
        Could be trying to make themselves a target for a big acquihire.
        • to11mtm 15 hours ago
          Cynical take: Try to get acquired by Intel for Arc.
          • dogma1138 13 hours ago
            Intel is in a vastly better shape than AMD, they have the software pretty much nailed down.
            • lhl 11 hours ago
              I've recently been poking around with Intel oneAPI and IPEX-LLM. While there are things that I find refreshing (like their ability to actually respond to bug reports in a timely manner, or at all) on a whole, support/maturity actually doesn't match the current state of ROCm.

              PyTorch requires it's own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work - both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked, etc, etc.

              • moffkalast 6 hours ago
                I've recently been trying to get IPEX working as well, apparently picking Ubuntu 24.04 was a mistake, because while things compile, everything fails at runtime. I've tried native, docker, different oneAPI versions, threw away a solid week of afternoons for nothing.

                SYCL with llama.cpp is great though, at least at FP16 since it supports nothing else but even Arc iGPUs easily give 2-4x performance compared to CPU inference.

                Intel should've just contributed to SYCL instead of trying to make their own thing and then forgot to keep maintaining it halfway through.

                • lhl 5 hours ago
                  My testing has been w/ a Lunar Lake Core 258V chip (Xe2 - Arc 140V) on Arch Linux. It sounds like you've tried a lot of things already, but case it helps, my notes for installing llama.cpp and PyTorch: https://llm-tracker.info/howto/Intel-GPUs

                  I have some benchmarks as well, and the IPEX-LLM backend performed a fair bit better than the SYCL llama.cpp backend for me (almost +50% pp512 and almost 2X tg128) so worth getting it working if you plan on using llama.cpp much on an Intel system. SYCL still performs significantly better than Vulkan and CPU backends, though.

                  As an end-user, I agree that it'd be way better if they could just contribute upstream somehow (whether to the SYCL backend, or if not possible, to a dependency-minized IPEX backend). the IPEX backend is one of the more maintained parts of IPEX-LLM, btw. I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."

                  • moffkalast 4 hours ago
                    Well that's funny, I think we already spoke on Reddit. I'm the guy who was testing the 125H recently. I guess there's like 5 of us who have intel hardware in total and we keep running into each other :P

                    Honestly I think there's just something seriously broken with the way IPEX expects the GPU driver to be on 24.04 and there's nothing I can really do about it except wait for them to fix it if I want to keep using this OS.

                    I am vaguely considering adding another drive and installing 22.04 or 20.04 with the exact kernel they want to see if that might finally work in the meantime, but honestly I'm fairly satisfied with the speed I get from SYCL already. The problem is more that it's annoying to integrate it directly through the server endpoint, every projects expects a damn ollama api or llama-cpp-python these days and I'm a fan of neither since it's just another layer of headaches to get those compiled with SYCL.

                    > I found a lot of stuff in that repo that depend on versions of oneKit that aren't even downloadable on Intel's site. I couldn't help but smirk when I heard someone say "Intel has their software nailed down."

                    Yeah well the fact that oneAPI 2025 got released, broke IPEX, and they still haven't figured out a way to patch it for months makes me think it's total chaos internally, where teams work against each other instead of talking and coordinating.

            • indolering 11 hours ago
              Tell that to the board.
            • bboygravity 11 hours ago
              Someone never used intel killer wifi software.
          • dangero 13 hours ago
            More cynical take: Trying to get acquired by nvidia
            • dizhn 11 hours ago
              Person below says they (the whole team) already joined Nvidia.
          • shiroiushi 15 hours ago
            More cynical take: this would be a bad strategy, because Intel hasn't shown much competence in its leadership for a long time, especially in regards to GPUs.
            • rockskon 14 hours ago
              They've actually been making positive moves with GPUs lately along with a success story for the B580.
              • kimixa 10 hours ago
                B580 being a "success" is purely a business decision as a loss leader to get their name into the market. A larger die on a newer node than either Nvidia or AMD means their per-unit costs are higher, and are selling it at a lower price.

                That's not a long-term success strategy. Maybe good for getting your name in the conversation, but not sustainable.

                • bitmasher9 3 hours ago
                  It’s a long term strategy to release a hardware platform with minimal margins in the beginning to attract software support needed for long term viability.

                  One of the benefits of being Intel.

                • 7speter 31 minutes ago
                  I don’t know if this matters but while the B580 has a die comparable in size to a 4070 (~280mm^2), it has about half the transistors (~17-18 billion), iirc.
                • jvanderbot 6 hours ago
                  I was reading this whole thread as about technical accomplishment and non-nvidia GPU capabilities, not business. So I think you're talking about different definitions of "Success". Definitely counts, but not what I was reading.
              • schmidtleonard 13 hours ago
                Yeah but MLID says they are losing money on every one and have been winding down the internal development resources. That doesn't bode well for the future.

                I want to believe he's wrong, but on the parts of his show where I am in a position to verify, he generally checks out. Whatever the opposite of Gell-Mann Amnesia is, he's got it going for him.

                • sodality2 13 hours ago
                  MLID on Intel is starting to become the same as UserBenchmark on AMD (except for the generally reputable sources)... he's beginning to sound like he simply wants Intel to fail, to my insider-info-lacking ears. For competition's sake I really hope that MLID has it wrong (at least the opining about the imminent failure of Intel's GPU division), and that the B series will encourage Intel to push farther to spark more competition in the GPU space.
                • oofabz 13 hours ago
                  The die size of the B580 is 272 mm2, which is a lot of silicon for $249. The performance of the GPU is good for its price but bad for its die size. Manufacturing cost is closely tied to die size.

                  272 mm2 puts the B580 in the same league as the Radeon 7700XT, a $449 card, and the GeForce 4070 Super, which is $599. The idea that Intel is selling these cards at a loss sounds reasonable to me.

                  • KeplerBoy 5 hours ago
                    At a loss seems a bit overly dramatic. I'd guess Nvidia sells SKUs for three times their marginal cost. Intel is probably operating at cost without any hopes of recouping R&D with the current SKUs, but that's reasonable for an aspiring competitor.
                    • 7speter 27 minutes ago
                      It kinda seems they are covering the cost of throwing massive amounts of resources trying to get Arc’s drivers in shape.
                  • tjoff 9 hours ago
                    Though you assume the prices of the competition are reasonable. There are plenty of reasons for them not to be. Availability issues, lack of competition, other more lucrative avenues etc.

                    Intel has neither, or at least not as much of them.

                • derektank 13 hours ago
                  Wait, are they losing money on every one in the sense that they haven't broken even on research and development yet? Or in the sense that they cost more to manufacture than they're sold at? Because one is much worse than the other.
                  • rockskon 11 hours ago
                    They're trying to unseat Radeon as the budget card. That means making a more enticing offer than AMD for a temporary period of time.
        • dboreham 15 hours ago
          > Could be trying to make themselves a target for a big acquihire.

          Is this something anyone sets out to do?

      • tesch1 17 hours ago
        AMD. Just one more dot to connect ;)
      • dylan604 16 hours ago
        It would be interesting to find out AMD is funding these other companies to ensure the shim happens while they focus on not doing it.
        • bushbaba 15 hours ago
          AMD is kind of doing that funding by pricing its GPUs low and/or giving them away at cost to these startups
      • shmerl 16 hours ago
        Is this effort benefiting everyone? I.e. where is it going / is it open source?
    • llama-mini 13 hours ago
      From Lamini, we have a private AMD GPU cluster, ready to serve any one who want to try MI300x or MI250 with inference and tuning.

      We just onboarded a customer to move from openai API to on-prem solution, currently evaluating MI300x for inference.

      Email me at my profile email.

    • 3abiton 5 hours ago
      My understanding is that once JAX takes off, the cuda advantage is gone for nvidia. That's a big if/when though.
    • jsheard 17 hours ago
      Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.
      • nomel 15 hours ago
        This is discussed in the lex Friedman episode. AMD’s own demo would kernel panic when run in a loop [1].

        [1] https://youtube.com/watch?v=dNrTrx42DGQ&t=3218

        • kranke155 9 hours ago
          Interesting. I wonder if focusing on GPUs and CPUs is something that requires two companies instead of one, whether the concentration of resources just leads to one arm of your company being much better than the other.
      • noch 13 hours ago
        > Tinygrad was another one, but they ended up getting frustrated with AMD and semi-pivoted to Nvidia.

        From their announcement on 20241219[^0]:

        "We are the only company to get AMD on MLPerf, and we have a completely custom driver that's 50x simpler than the stock one. A bit shocked by how little AMD cared, but we'll take the trillions instead of them."

        From 20241211[^1]:

        "We gave up and soon tinygrad will depend on 0 AMD code except what's required by code signing.

        We did this for the 7900XTX (tinybox red). If AMD was thinking strategically, they'd be begging us to take some free MI300s to add support for it."

        ---

        [^0]: https://x.com/__tinygrad__/status/1869620002015572023

        [^1]: https://x.com/__tinygrad__/status/1866889544299319606

  • jroesch 17 hours ago
    Note: this is old work, and much of the team working on TVM, and MLC were from OctoAI and we have all recently joined NVIDIA.
    • sebmellen 17 hours ago
      Is there no hope for AMD anymore? After George Hotz/Tinygrad gave up on AMD I feel there’s no realistic chance of using their chips to break the CUDA dominance.
      • comex 15 hours ago
        Maybe from Modular (the company Chris Lattner is working for). In this recent announcement they said they had achieved competitive ML performance… on NVIDIA GPUs, but with their own custom stack completely replacing CUDA. And they’re targeting AMD next.

        https://www.modular.com/blog/introducing-max-24-6-a-gpu-nati...

        • behnamoh 15 hours ago
          Ah yes, the programming language (Mojo) that requires an account before I can use it...
          • melodyogonna 12 hours ago
            Mojo no longer requires an account to install.

            But that is irrelevant to the conversation because this is not about Mojo but something they call MAX. [1]

            1. https://www.modular.com/max

      • steeve 1 hour ago
        We (ZML) have AMD MI300X working just fine, in fact, faster than H100
      • latchkey 16 hours ago
        • krackers 16 hours ago
          That's almost word for word what geohotz said last year?
          • refulgentis 16 hours ago
            What part?

            I assume the part where she said there's "gaps in the software stack", because that's the only part that's attributed to her.

            But I must be wrong because that hasn't been in dispute or in the news in a decade, it's not a geohot discovery from last year.

            Hell I remember a subargument of a subargument re: this being an issue a decade ago in macOS dev (TL;Dr whether to invest in opencl)

          • bn-l 16 hours ago
            I went through the thread. There’s an argument to be made in firing Su for being so spaced out as to miss an op for their own CUDA for free.
            • hedgehog 15 hours ago
              Not remotely, how did you get to that idea?
              • refulgentis 13 hours ago
                Kids this days (shakes fist)

                tl;dr there's a non-unsubstantial # of people who learn a lot from geohot. I'd say about 3% of people here will be confused if you thought of him as less than a top technical expert across many comp sci fields.

                And he did the geohot thing recently, way tl;dr: acted like there was a scandal being covered up by AMD around drivers that was causing them to "lose" to nVidia.

                He then framed AMD not engaging with him on this topic as further covering-up and choosing to lose.

                So if you're of a certain set of experiences, you see an anodyne quote from the CEO that would have been utterly unsurprising dating back to when ATI was still a company, and you'd read it as the CEO breezily admitting in public that geohot was right about how there was malfeasance, followed by a cover up, implying extreme dereliction of duty, because she either helped or didn't realize till now.

                I'd argue this is partially due to stonk-ification of discussions, there was a vague, yet often communicated, sense there was something illegal happening. Idea was it was financial dereliction of duty to shareholders.

      • dismalaf 16 hours ago
        IMO the hope shouldn't be that AMD specifically wins, rather it's best for consumers that hardware becomes commoditized and prices come down.

        And that's what's happening, slowly anyway. Google, Apple and Amazon all have their own AI chips, Intel has Gaudi, AMD had their thing, and the software is at least working on more than just Nvidia. Which is a win. Even if it's not perfect. I'm personally hoping that everyone piles in on a standard like SYCL.

      • quotemstr 15 hours ago
        The world is bigger than AMD and Nvidia. Plenty of interesting new AI-tuned non-GPU accelerators coming online.
        • grigio 10 hours ago
          I hope, name some NPU who can run a 70B model..
      • llm_trw 17 hours ago
        Not really.

        AMD is constitutionally incapable of shipping anything but mid range hardware that requires no innovation.

        The only reason why they are doing so well in CPUs right now is that Intel has basically destroyed itself without any outside help.

        • adrian_b 41 minutes ago
          In CPUs, AMD has made many innovations that have been copied by Intel only after many years and this delay had an important contribution to Intel's downfall.

          The most important has been the fact that AMD has predicted correctly that big monolithic CPUs will no longer be feasible in the future CMOS fabrication technologies, so they have designed the Zen family since the beginning with a chiplet-based architecture. Intel had attempted to ridicule them, but after losing many billions they have been forced to copy this strategy.

          Also in the microarchitecture of their CPUs AMD has made the right choices since the beginning and then they have improved it constantly with each generation. The result is that now the latest Intel big core, Lion Cove, has a microarchitecture that is much more similar to AMD Zen 5 than to any of the previous Intel cores, because they had to do this to get a competitive core.

          In the distant past, AMD has also introduced a lot of innovations long before they were copied by Intel, but it is true that those had not been invented by AMD, but they had been copied by AMD from more expensive CPUs, like DEC Alpha or Cray or IBM POWER, but Intel has also copied them only after being forced by the competition with AMD.

        • ksec 15 hours ago
          Everything is comparative. AMD isn't perfect. As an Ex Shareholder I have argued they did well partly because of Intel's downfall. In terms of execution it is far from perfect.

          But Nvidia is a different beast. It is a bit like Apple in the late 00s where you take business, forecast, marketing, operation, software, hardware, sales etc You take any part of it and they are all industry leading. And having industry leading capability is only part of the game, having it all work together is completely another thing. And unlike Apple where they lost direction once Steve Jobs passed away and weren't sure about how to deploy capital. Jensen is still here, and they have more resources now making Nvidia even more competitive.

          It is often most people underestimate the magnitude of the task required, ( I like to tell the story again about an Intel GPU engineer in 2016 arguing they could take dGPU market shares by 2020, and we are now 2025 ), over estimate the capability of an organisation, under estimate the rival's speed of innovation and execution. These three thing combined is why most people are often off the estimate by an order of magnitude.

          • llm_trw 15 hours ago
            Yeah, no.

            We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now. I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.

            By comparison if AMD could write a driver that didn't shit itself when it had to multiply more than two matrices in a row they'd be selling cards faster than they can make them. You don't need to sell the best shovels in a gold rush to make mountains of money, but you can't sell teaspoons as premium shovels and expect people to come back.

            • ksec 13 hours ago
              >We are in the middle of a monopoly squeeze by NVidia on the most innovative part of the economy right now.

              I am not sure which part of Nvidia is monopoly. That is like suggesting TSMC has a monopoly.

              • vitus 5 hours ago
                > That is like suggesting TSMC has a monopoly.

                They... do have a monopoly on foundry capacity, especially if you're looking at the most advanced nodes? Nobody's going to Intel or Samsung to build 3nm processors. Hell, there have been whispers over the past month that even Samsung might start outsourcing Exynos to TSMC; Intel already did that with Lunar Lake.

                Having a monopoly doesn't mean that you are engaging in anticompetitive behavior, just that you are the only real option in town.

            • Vecr 7 hours ago
              Will they? Given the structure of global controls on GPUs, Nvidia is a de-facto self funding US government company.

              Maybe the US will do something if GPU price becomes the limit instead of the supply of chips and power.

            • kadoban 14 hours ago
              What effect did the DOJ have on MS in the 90s? Didn't all of that get rolled back before they had to pay a dime, and all it amounted to was that browser choice screen that was around for a while? Hardly a crippling blow. If anything that showed the weakness of regulators in fights against big tech, just outlast them and you're fine.
            • shiroiushi 14 hours ago
              >I expect the DOJ to hit them harder than they did MS in the 90s given the bullshit they are pulling and the drag on the economy they are causing.

              It sounds like you're expecting extreme competence from the DOJ. Given their history with regulating big tech companies, and even worse, the incoming administration, I think this is a very unrealistic expectation.

        • perching_aix 16 hours ago
          And I'm supposed to believe that HN is this amazing platform for technology and science discussions, totally unlike its peers...
          • zamadatix 16 hours ago
            The above take is worded a bit cynical but is their general approach to GPUs lately across the board e.g. https://www.techpowerup.com/326415/amd-confirms-retreat-from...

            Also I'd take HN as being being an amazing platform for the overall consistency and quality of moderation. Anything beyond that depends more on who you're talking to than where at.

          • petesergeant 16 hours ago
            Maybe be the change you want to see and tell us what the real story is?
            • perching_aix 9 hours ago
              We seem to disagree on what the change in the world I'd like to see is like, which is a real shocker I'm sure.

              Personally, I think that's when somebody who has no real information to contribute doesn't try to pretend that they do.

              So thanks for the offer, but I think I'm already delivering on that realm.

          • llm_trw 15 hours ago
            I don't really care what you believe.

            Everyone whose dug deep into what AMD is doing has left in disgust if they are lucky and bankruptcy if they are not.

            If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.

            • AnthonyMouse 13 hours ago
              > If I can save someone else from wasting $100,000 on hardware and six months of their life then my post has done more good than the AMD marketing department ever will.

              This seems like unuseful advice if you've already given up on them.

              You tried it and at some point in the past it wasn't ready. But by not being ready they're losing money, so they have a direct incentive to fix it. Which would take a certain amount of time, but once you've given up you no longer know if they've done it yet or not, at which point your advice would be stale.

              Meanwhile the people who attempt it apparently seem to get acquired by Nvidia, for some strange reason. Which implies it should be a worthwhile thing to do. If they've fixed it by now which you wouldn't know if you've stopped looking, or they fix it in the near future, you have a competitive advantage because you have access to lower cost GPUs than your rivals. If not, but you've demonstrated a serious attempt to fix it for everyone yourself, Nvidia comes to you with a sack full of money to make sure you don't finish, and then you get a sack full of money. That's win/win, so rather than nobody doing it, it seems like everybody should be doing it.

              • llm_trw 11 hours ago
                I've tried it three times.

                I've seen people try it every six months for two decades now.

                At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.

                I'm deeply worried about stagnation in the CPU space now that they are top dog and Intel is dead in the water.

                Here's hoping China and Risk V save us.

                >Meanwhile the people who attempt it apparently seem to get acquired by Nvidia

                Everyone I've seen base jumping has gotten a sponsorship from redbull, ergo. everyone should basejump.

                Ignore the red smears around the parking lot.

                • AnthonyMouse 21 minutes ago
                  > At some point you just have to accept that AMD is not a serious company, but is a second rate copycat and there is no way to change that without firing everyone from middle management up.

                  AMD has always punched above their weight. Historically their problem was that they were the much smaller company and under heavy resource constraints.

                  Around the turn of the century the Athlon was faster than the Pentium III and then they made x86 64-bit when Intel was trying to screw everyone with Itanic. But the Pentium 4 was a marketing-optimized design that maximized clock speed at the expense of heat and performance per clock. Intel was outselling them even though the Athlon 64 was at least as good if not better. The Pentium 4 was rubbish for laptops because of the heat problems, so Intel eventually had to design a separate chip for that, but they also had the resources to do it.

                  That was the point that AMD made their biggest mistake. When they set out to design their next chip the competition was the Pentium 4, so they made a power-hungry monster designed to hit high clock speeds at the expense of performance per clock. But the reason more people didn't buy the Athlon 64 wasn't that they couldn't figure out that a 2.4GHz CPU could be faster than a 2.8GHz CPU, it was all the anti-competitive shenanigans Intel was doing behind closed doors to e.g. keep PC OEMs from featuring systems with AMD CPUs. Meanwhile by then Intel had figured out that the Pentium 4 was, in fact, a bad design, when their own Pentium M laptops started outperforming the Pentium 4 desktops. So the Pentium 4 line got canceled and Bulldozer had to go up against the Pentium M-based Core, which nearly bankrupted AMD and compromised their ability to fund the R&D needed to sustain state of the art fabs.

                  Since then they've been climbing back out of the hole but it wasn't until Ryzen in 2017 that you could safely conclude they weren't on the verge of bankruptcy, and even then they were saddled with a lot of debt and contracts requiring them to use the uncompetitive Global Foundries fabs for several years. It wasn't until Zen4 in 2022 that they finally got to switch the whole package to TSMC.

                  So until quite recently the answer to the question "why didn't they do X?" was obvious. They didn't have the money. But now they do.

                • Const-me 6 hours ago
                  > I've tried it three times

                  Have you tried compute shaders instead of that weird HPC-only stuff?

                  Compute shaders are widely used by millions of gamers every day. GPU vendors have huge incentive to make them reliable and efficient: modern game engines are using them for lots of thing, e.g. UE5 can even render triangle meshes with GPU compute instead of graphics (the tech is called nanite virtualized geometry). In practice they work fine on all GPUs, ML included: https://github.com/Const-me/Cgml

            • perching_aix 9 hours ago
              I'd be very concerned if somebody makes a $100K decision based on a comment where the author couldn't even differentiate between the words "constitutionally" and "institutionally", while providing as much substance as any other random techbro on any random forum and being overwhelmingly oblivious to it.
        • lofaszvanitt 13 hours ago
          It had to destroy itself. These companies do not act on their own...
  • zamalek 16 hours ago
    I have been playing around with Phi-4 Q6 on my 7950x and 7900XT (with HSA_OVERRIDE_GFX_VERSION). It's bloody fast, even with CPU alone - in practical terms it beats hosted models due to the roundtrip time. Obviously perf is more important if you're hosting this stuff, but we've definitely reached AMD usability at home.
  • throwaway314155 18 hours ago
    > Aug 9, 2023

    Ignoring the very old (in ML time) date of the article...

    What's the catch? People are still struggling with this a year later so I have to assume it doesn't work as well as claimed.

    I'm guessing this is buggy in practice and only works for the HF models they chose to test with?

    • Const-me 17 hours ago
      It’s not terribly hard to port ML inference to alternative GPU APIs. I did it for D3D11 and the performance is pretty good too: https://github.com/Const-me/Cgml

      The only catch is, for some reason developers of ML libraries like PyTorch aren’t interested in open GPU APIs like D3D or Vulkan. Instead, they focus on proprietary ones i.e. CUDA and to lesser extent ROCm. I don’t know why that is.

      D3D-based videogames are heavily using GPU compute for more than a decade now. Since Valve shipped SteamDeck, the same now applies to Vulkan on Linux. By now, both technologies are stable, reliable and performant.

      • jsheard 17 hours ago
        Isn't part of it because the first-party libraries like cuDNN are only available through CUDA? Nvidia has poured a ton of effort into tuning those libraries so it's hard to justify not using them.
        • Const-me 17 hours ago
          Unlike training, ML inference is almost always bound by memory bandwidth as opposed to computations. For this reason, tensor cores, cuDNN, and other advanced shenanigans make very little sense for the use case.

          OTOH, general-purpose compute instead of fixed-function blocks used by cuDNN enables custom compression algorithms for these weights which does help, by saving memory bandwidth. For example, I did custom 5 bits/weight quantization which works on all GPUs, no hardware support necessary, just simple HLSL codes: https://github.com/Const-me/Cgml?tab=readme-ov-file#bcml1-co...

          • boroboro4 15 hours ago
            Only local (read batch size 1) ML inference is memory bound, production loads are pretty much compute bound. Prefill phase is very compute bound, and with continuous batching generation phase is getting mixed with prefill, which makes whole process altogether to be compute bound too. So no, tensor cores and all other shenanigans absolutely critical for performant inference infrastructure.
            • Const-me 15 hours ago
              PyTorch is a project by Linux foundation. The about page with the mission of the foundation contains phrases like “empowering generations of open source innovators”, “democratize code”, and “removing barriers to adoption”.

              I would argue running local inference with batch size=1 is more useful for empowering innovators compared to running production loads on shared servers owned by companies. Local inference increases count of potential innovators by orders of magnitude.

              BTW, in the long run it may also benefit these companies because in theory, an easy migration path from CUDA puts a downward pressure on nVidia’s prices.

              • idonotknowwhy 14 hours ago
                Most people running local inference do so thorough quants with llamacpp (which runs on everything) or awq/exl2/mlx with vllm/tabbyAPI/lmstudio which are much faster to than using pytorch directly
    • lhl 5 hours ago
      It depends on what you mean by "this." MLC's catch is that you need to define/compile models for it with TVM. Here is the list of supported model architectures: https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/m...

      llama.cpp has a much bigger supported model list, as does vLLM and of course PyTorch/HF transformers covers everything else, all of which work w/ ROCm on RDNA3 w/o too much fuss these days.

      For inference, the biggest caveat is that Flash Attention is only an aotriton implementation, which besides being less performant sometimes, also doesn't support SWA. For CDNA there is a better CK-based version of FA, but CK doesn't not have RDNA support. There are a couple people at AMD apparently working on native FlexAttention, os I guess we'll how that turns out.

      (Note the recent SemiAccurate piece was on training, which I'd agree is in a much worse state (I have personal experience with it being often broken for even the simplest distributed training runs). Funnily enough, if you're running simple fine tunes on a single RDNA3 card, you'll probably have a better time. OOTB, a 7900 XTX will train at about the same speed as an RTX 3090 (4090s blow both of those away, but you'll probably want more cards and VRAM of just move to H100s).

  • mattfrommars 14 hours ago
    Great, I have yet to understand why does not the ML community really push or move away from CUDA? To me, it feel like a dinosaur move to build on top of CUDA which is screaming proprietary nothing about it is open source or cross platform.

    The reason why I say its dinosaur is, imagine, we as a dev community continued to build on top of Flash or Microsoft Silverlight...

    LLM and ML has been out for quiet a while, with AI/LLM advancement, the transition must have been much quicker to move cross platform. But this hasn't yet and not sure when it will happen.

    Building a translation layer on top CUDA is not the answer either to this problem.

    • idonotknowwhy 14 hours ago
      For me personally, hacking together projects as a hobbiest, 2 reasons :

      1. It just works. When i tried to build things on Intel Arcs, i spent way more hours bikeshedding ipex and driver issues than developing

      2. LLMs seem to have more cuda code in their training data. I can leverage claude and 4o to help me build things with cuda, but trying to get them to help me do the same things on ipex just doesn't work.

      I'd very much love a translation layer for Cuda, like a dxvk or wine equivalent.

      Would save a lot of money since Arc gpus are in the bargain bin and nvidia cloud servers are double the price of AMD.

      As it stands now, my dual Intel Arc rig is now just a llama.cpp inference server for the family to use.

      • FloatArtifact 10 hours ago
        What kind of model learn and what's its token output on intel gpu's?
    • dwood_dev 14 hours ago
      Except I never hear complaints about CUDA from a quality perspective. The complaints are always about lock in to the best GPUs on the market. The desire to shift away is to make cheaper hardware with inferior software quality more usable. Flash was an abomination, CUDA is not.
      • AnthonyMouse 12 hours ago
        Flash was popular because it was an attractive platform for the developer. Back then there was no HTML5 and browsers didn't otherwise support a lot of the things Flash did. Flash Player was an abomination, it was crashy and full of security vulnerabilities, but that was a problem for the user rather than the developer and it was the developer choosing what to use to make the site.

        This is pretty much exactly what happens with CUDA. Developers like it but then the users have to use expensive hardware with proprietary drivers/firmware, which is the relevant abomination. But users have some ability to influence developers, so as soon as we get the GPU equivalent of HTML5, what happens?

        • wqaatwt 10 hours ago
          > users have to use expensive hardware with proprietary drivers/firmware

          What do you mean by that? People trying to run their own models are not “the users” they are a tiny insignificant niche segment.

          • AnthonyMouse 1 hour ago
            There are far more people running llama.cpp, various image generators, etc. than there are people developing that code. Even when the "users" are corporate entities, they're not necessarily doing any development in excess of integrating the existing code with their other systems.

            We're also likely to see a stronger swing away from "do inference in the cloud" because of the aligned incentives of "companies don't want to pay for all that hardware and electricity" and "users have privacy concerns" such that companies doing inference on the local device will have both lower costs and a feature they can advertise over the competition.

            What this is waiting for is hardware in the hands of the users that can actually do this for a mass market price, but there is no shortage of companies wanting a piece of that. In particular, Apple is going to be pushing that hard and despite the price they do a lot of volume, and then you're going to start seeing more PCs with high-VRAM GPUs or iGPUs with dedicated GDDR/HBM on the package as their competitors want feature parity for the thing everybody is talking about, the cost of which isn't actually that high, e.g. 40GB of GDDR6 is less than $100.

      • xedrac 14 hours ago
        Maybe the situation has gotten better in recent years, but my experience with Nvidia toolchains was a complete nightmare back in 2018.
        • claytonjy 14 hours ago
          The cuda situation is definitely better. The nvidia struggles are now with the higher-level software they’re pushing (triton, tensor-llm, riva, etc), tools that are the most performant option when they work, but a garbage developer experience when you step outside the golden path
          • cameron_b 1 hour ago
            I want to double-down on this statement, and call attention to the competitive nature of it. Specifically, I have recently tried to set up Triton on arm hardware. One might presume Nvidia would give attention to an architecture they develop, but the way forward is not easy. For some version of Ubuntu, you might have the correct version of python ( usually older than packaged ) but current LTS is out of luck for guidance or packages.

            https://github.com/triton-lang/triton/issues/4978

  • latchkey 16 hours ago
    Previously:

    Making AMD GPUs competitive for LLM inference https://news.ycombinator.com/item?id=37066522 (August 9, 2023 — 354 points, 132 comments)

  • melodyogonna 2 hours ago
    Modular claims that it achieves 93% GPU utilization on AMD GPUs [1], official preview release coming early next year, we'll see. I must say I'm bullish because of feedback I've seen people give about the performance on Nvidia GPUs

    1.https://www.modular.com/max

  • lasermike026 17 hours ago
    I believe these efforts are very important. If we want this stuff to be practical we are going to have to work on efficiency. Price efficiency is good. Power and compute efficiency would be better.

    I have been playing with llama.cpp to run interference on conventional cpus. No conclusions but it's interesting. I need to look at llamafile next.

  • lxe 16 hours ago
    A used 3090 is $600-900, performs better than 7900, and is much more versatile because CUDA
    • Uehreka 15 hours ago
      Reality check for anyone considering this: I just got a used 3090 for $900 last month. It works great.

      I would not recommend buying one for $600, it probably either won’t arrive or will be broken. Someone will reply saying they got one for $600 and it works, that doesn’t mean it will happen if you do it.

      I’d say the market is realistically $900-1100, maybe $800 if you know the person or can watch the card running first.

      All that said, this advice will expire in a month or two when the 5090 comes out.

      • idonotknowwhy 14 hours ago
        I've bought 5 used and they're all perfect. But that's what buyer protection on ebay is for. Had to send back an Epyc mobo with bent pins and ebay handled it fine.
  • Sparkyte 6 hours ago
    More players in the market the better. AI shouldn't be owned by one business.
  • mrcsharp 11 hours ago
    I will only consider AMD GPUs for LLM when I can easily make my AMD GPU available within WSL and Docker on Windows.

    For now, it is as if AMD does not exist in this field for me.

  • lhl 11 hours ago
    Just an FYI, this is writeup from August 2023 and a lot has changed (for the better!) for RDNA3 AI/ML support.

    That being said, I did some very recent inference testing on an W7900 (using the same testing methodology used by Embedded LLM's recent post to compare to vLLM's recently added Radeon GGUF support [1]) and MLC continues to perform quite well. On Llama 3.1 8B, MLC's q4f16_1 (4.21MB weights) performed +35% faster than llama.cpp w/ Q4_K_M w/ their ROCm/HIP backend (4.30MB weights, 2% size difference).

    That makes MLC still the generally fastest standalone inference engine for RDNA3 by a country mile. However, you have much less flexibility with quants and by and large have to compile your own for every model, so llama.cpp is probably still more flexible for general use. Also llama.cpp's (recently added to llama-server) speculative decoding can also give some pretty sizable performance gains. Using a 70B Q4_K_M + 1B Q8_0 draft model improves output token throughput by 59% on the same ShareGPT testing. I've also been running tests with Qwen2.5-Coder and using a 0.5-3B draft model for speculative decoding gives even bigger gains on average (depends highly on acceptance rate).

    Note, I think for local use, vLLM GGUF is still not suitable at all. When testing w/ a 70B Q4_K_M model (only 40GB), loading, engine warmup, and graph compilation took on avg 40 minutes. llama.cpp takes 7-8s to load the same model.

    At this point for RDNA3, basically everything I need works/runs for my use cases (primarily LLM development and local inferencing), but almost always slower than an RTX 3090/A6000 Ampere (a new 24GB 7900 XTX is $850 atm, used or refurbished 24 GB RTX 3090s are in in the same ballpark, about $800 atm; a new 48GB W7900 goes for $3600 while an 48GB A6000 (Ampere) goes for $4600). The efficiency gains can be sizable. Eg, on my standard llama-bench test w/ llama2-7b-q4_0, the RTX 3090 gets a tg128 of 168 t/s while the 7900 XTX only gets 118 t/s even though both have similar memory bandwidth (936.2 GB/s vs 960 GB/s). It's also worth noting that since the beginning of the year, the llama.cpp CUDA implementation has gotten almost 25% faster, while the ROCm version's performance has stayed static.

    There is an actively (solo dev) maintained fork of llama.cpp that sticks close to HEAD but basically applies a rocWMMA patch that can improve performance if you use the llama.cpp FA (still performs worse than w/ FA disabled) and in certain long-context inference generations (on llama-bench and w/ this ShareGPT serving test you won't see much difference) here: https://github.com/hjc4869/llama.cpp - The fact that no one from AMD has shown any interest in helping improve llama.cpp performance (despite often citing llama.cpp-based apps in marketing/blog posts, etc is disappointing ... but sadly on brand for AMD GPUs).

    Anyway, for those interested in more information and testing for AI/ML setup for RDNA3 (and AMD ROCm in general), I keep a doc with lots of details here: https://llm-tracker.info/howto/AMD-GPUs

    [1] https://embeddedllm.com/blog/vllm-now-supports-running-gguf-...

  • dragontamer 18 hours ago
    Intriguing. I thought AMD GPUs didn't have tensor cores (or matrix multiplication units) like NVidia. I believe they are only dot product / fused multiply and accumulate instructions.

    Are these LLMs just absurdly memory bound so it doesn't matter?

    • boroboro4 15 hours ago
      They absolutely do have similar cores to tensor cores, it's called matrix cores. And they have particular instructions to utilize them (MFMA). Note I'm talking about DC compute chips, like MI300.

      LLMs aren't memory bound in production loads, they are pretty much compute bound too, at least in prefill phase, but in practice in general too.

    • ryao 18 hours ago
      They don’t, but GPUs were designed for doing matrix multiplications even without the special hardware instructions for doing matrix multiplication tiles. Also, the forward pass for transformers is memory bound, and that is what does token generation.
      • dragontamer 17 hours ago
        Well sure, but in other GPU tasks, like Raytracing, the difference between these GPUs is far more pronounced.

        And AMD has passable Raytracing units (NVidias are better but the difference is bigger than these LLM results).

        If RAM is the main bottleneck then CPUs should be on the table.

        • IX-103 17 hours ago
          > If RAM is the main bottleneck then CPUs should be on the table

          That's certainly not the case. The graphics memory model is very different from the CPU memory model. Graphics memory is explicitly designed for multiple simultaneous reads (spread across several different buses) at the cost of generality (only portions of memory may be available on each bus) and speed (the extra complexity means reads are slower). This makes then fast at doing simple operations on a large amount of data.

          CPU memory only has one bus, so only a single read can happen at a time (a cache line read), but can happen relatively quickly. So CPUs are better for workloads with high memory locality and frequent reuse of memory locations (as is common in procedural programs).

          • dragontamer 12 hours ago
            > CPU memory only has one bus

            If people are paying $15,000 or more per GPU, then I can choose $15,000 CPUs like EPYC that have 12-channels or dual-socket 24-channel RAM.

            Even desktop CPUs are dual-channel at a minimum, and arguably DDR5 is closer to 2 or 4 buses per channel.

            Now yes, GPU RAM can be faster, but guess what?

            https://www.tomshardware.com/pc-components/cpus/amd-crafts-c...

            GPUs are about extremely parallel performance, above and beyond what traditional single-threaded (or limited-SIMD) CPUs can do.

            But if you're waiting on RAM anyway?? Then the compute-method doesn't matter. Its all about RAM.

        • webmaven 17 hours ago
          RAM is (often) the bottleneck for highly parallel GPUs, but not for CPUs.

          Though the distinction between the two categories is blurring.

        • schmidtleonard 17 hours ago
          CPUs have pitiful RAM bandwidth compared to GPUs. The speeds aren't so different but GPU RAM busses are wiiiiiiiide.
          • teleforce 16 hours ago
            Compute Express Link (CXL) should mostly solve limited RAM with CPU:

            1) Compute Express Link (CXL):

            https://en.wikipedia.org/wiki/Compute_Express_Link

            PCIe vs. CXL for Memory and Storage:

            https://news.ycombinator.com/item?id=38125885

            • schmidtleonard 16 hours ago
              Gigabytes per second? What is this, bandwidth for ants?

              My years old pleb tier non-HBM GPU has more than 4 times the bandwidth you would get from a PCIe Gen 7 x16 link, which doesn't even officially exist yet.

              • teleforce 14 hours ago
                Yes CXL will soon benefit from PCIe Gen 7 x16 with expected 64GB/s in 2025 and the non-HBM bandwidth I/O alternative is increasing rapidly by the day. For most inferences of near real-time LLM it will be feasible. For majority of SME companies and other DIY users (humans or ants) with their localized LLM should not be any issues [1],[2]. In addition new techniques for more efficient LLM are being discover to reduce the memory consumption [3].

                [1] Forget ChatGPT: why researchers now run small AIs on their laptops:

                https://news.ycombinator.com/item?id=41609393

                [2] Welcome to LLMflation – LLM inference cost is going down fast:

                https://a16z.com/llmflation-llm-inference-cost/

                [3] New LLM optimization technique slashes memory costs up to 75%:

                https://news.ycombinator.com/item?id=42411409

                • schmidtleonard 5 hours ago
                  No. Memory bandwidth is the important factor for LLM inference. 64GB/s is 4x less than the hypothetical I granted you (Gen7x16 = 256GB/s), which is 4x less than the memory bandwidth on my 2 year old pleb GPU (1TB/s), which is 10x less than a state of the art professional GPU (10TB/s), which is what the cloud services will be using.

                  That's 160x worse than cloud and 16x worse than what I'm using for local LLM. I am keenly aware of the options for compression. I use them every day. The sacrifices I make to run local LLM cut deep compared to the cloud models, and squeezing it down by another factor of 16 will cut deep on top of cutting deep.

                  Nothing says it can't be useful. My most-used model is running in a microcontroller. Just keep those expectations tempered.

                  (EDIT: changed the numbers to reflect red team victory over green team on cloud inference.)

              • Dylan16807 13 hours ago
                > 4 times the bandwidth you would get from a PCIe Gen 7 x16 link

                So you have a full terabyte per second of bandwidth? What GPU is that?

                (The 64GB/s number is an x4 link. If you meant you have over four times that, then it sounds like CXL would be pretty competitive.)

                • schmidtleonard 5 hours ago
                  https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889

                      Memory Size: 24 GB
                      Memory Type: GDDR6X
                      Memory Bus: 384 bit
                      Bandwidth: 1.01 TB/s
                  
                  Bandwidth between where the LLM is stored and where your matrix*vector multiplies are done is the important figure for inference. You want to measure this in terabytes per second, not gigabytes per second.

                  A 7900XTX also has 1TB/s on paper, but you'll need awkward workarounds every time you want to do something (see: article) and half of your workloads will stop dead with driver crashes and you need to decide if that's worth $500 to you.

                  Stacking 3090s is the move if you want to pinch pennies. They have 24GB of memory and 936GB/s of bandwidth each, so almost as good as the 4090, but they're as cheap as the 7900XTX with none of the problems. They aren't as good for gaming or training workloads, but for local inference 3090 is king.

                  It's not a coincidence that the article lists the same 3 cards. These are the 3 cards you should decide between for local LLM, and these are the 3 cards a true competitor should aim to exceed.

                  • Dylan16807 24 minutes ago
                    A 4090 is not "years old pleb tier". Same for 3090 and 7900XTX.

                    There's a serious gap between CXL and RAM, but it's not nearly as big as it used to be.

                • adrian_b 2 hours ago
                  Already an ancient Radeon VII from 5 years ago had 1 terabyte per second of memory bandwidth.

                  Later consumer GPUs have regressed and only RTX 4090 offers the same memory bandwidth in the current NVIDIA generation.

                  • Dylan16807 22 minutes ago
                    Radeon VII had HBM.

                    So I can understand a call for returning to HBM, but it's an expensive choice and doesn't fit the description.

                • fc417fc802 9 hours ago
                  [dead]
    • throwaway314155 18 hours ago
      > Are these LLMs just absurdly memory bound so it doesn't matter?

      During inference? Definitely. Training is another story.

  • sroussey 18 hours ago
    [2023]

    Btw, this is from MLC-LLM which makes WebLLM and other good stuff.

  • aussieguy1234 12 hours ago
    I got a "gaming" PC for LLM inference with an RTX 3060. I could have gotten more VRAM for my buck with AMD, but didn't because at the time alot of inference needed CUDA.

    As soon AMD is as good as Nvidia for inference, I'll switch over.

    But I've read on here that their hardware engineers aren't even given enough hardware to test with...

  • starlite-5008 5 hours ago
    [dead]
  • leonewton253 15 hours ago
    This benchmark doest look right. Is it using the tensor cores in the Nvidia gpu? AMD does not have AI cores so should run noticeably slower.
    • nomel 15 hours ago
      AMD has WMMA.