Hacker Newsnew | past | comments | ask | show | jobs | submit | junrushao1994's commentslogin

This is great! Have you guys considered integrating with one of the existing systems?


Thanks for the question. Currently Punica is on the ecosystem of PyTorch and HuggingFace Transformers. So PyTorch users can start to use Punica now.

Look forward to collaboration with TVM and MLC to reach more users :)


Yeah thanks for sharing! This is definitely super valuable data and insights :)

Regarding exllama-V2, MLC/TVM does benchmark against it:

- Single GPU: https://github.com/mlc-ai/llm-perf-bench#int4-quantized-sing...

- Multi GPU: Figure 2 in the blog: http://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infere...

> vLLM focuses more on batching performance

Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz

Regarding batching, it is coming pretty soon, and we will have another blog post on this.


Machine Learning Compilation (MLC) now supports compiling LLMs to multiple GPUs.

For Llama2-70B, it runs 4-bit quantized Llama2-70B at:

- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k

- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k

- Also it is scales well with 8 A10G/A100 GPUs in our experiment.

Details:

- Blog post: https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...

- Project: https://github.com/mlc-ai/mlc-llm


Ah please help us by submitting a PR! I noticed the rust build failed last night but didn’t get a chance to look into it


This is a particular unique course offering introduction on ML compilation and deployment :)


I really liked the style of the instructor (Kolter), and the reason I like this course very much is because each lecture is followed by an implementation video along with the notebook file.

In most Deep Learning courses, the implementation is left to TAs and neither recorded nor made available. This course is an exception. Another bright exception is NYU Deep Learning course [0] by Yann LeCun and Alfredo Canziani. In that course, too, all recitations ("Practica") are recorded and made available. And Canziani is a great teacher.

[0]: https://atcold.github.io/pytorch-Deep-Learning


I also really like the instructor for this course!

Seems like he really cares. I looked him up and I guess he was a student of Andrew Ng (the legendary ML lecturer!!) so it makes sense.


Thanks, this is a wonderful recommendation


As of today performance in WebGPU isn't as competitive yet, but there are really quite a lot of low-hanging fruits for WebGPU to pick up.


LLM decoding is dominated by memory bandwidth, and 3090Ti and 4090 happen to have the identical theoretical memory bandwidth


tbh im not sure what amds plan is on ROCm support on consumer devices, but i dont really think amd is being fraudulent or something.

Both rocm and vulkan are supported in MLC LLM as mentioned in our blog post. we are aware that rocm is not sufficient to cover consumer hardwares, and in this case vulkan is a nice backup!


If you click the "Radeon" tab here[1], dated 27 Jul, AMD claim ROCm support on a wide range of consumer cards, with HIP SDK support on RX 6800 and up, under Windows. The Linux situation seems less clear.

1: https://rocm.docs.amd.com/en/latest/release/windows_support....


Given AMDs track record. The 6900 will be dropped next year or early 2025.


How does the performance with Vulkan compare to the ROCm performance on the same hardware?


We haven't done any comparison them yet, but generally we believe Vulkan as a more generic cross-vendor API should be slower than ROCm. Same for CUDA vs Vulkan.


One of the authors here. Glad it’s on HackerNews!

There are two points I personally wanted to make through this project:

1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving; 2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.

So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.


The numbers look amazing.

Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.

LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.


> Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards?

Thanks for asking! I personally believe TVM Unity is a proper software stack for ML compilation (MLC), and its existing optimizations (e.g. TensorCore offloading) can be transparently transferred to AMD/Intel/Apple/mobile GPUs without too much engineering effort.

Of course my claim is limited to ML workloads. Not an expert outside the ML world, so I couldn't say for general HPC.


Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.

Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)


This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)


Thank you for this work. I will be staying on nvidia for now, but applaud any progress towards much needed credible competition in the consumer/enthusiast AI hardware space.

One question: given your experience, when would you predict a near parity in software stack support between te different platforms, so that a choice of GPU becomes one mostly of price/performance? It does not need to be like the AMD/Intel in the CPU market where a consumer will have no doubts about software compatibility, but let's say like the gaming gpu market where a game having problems on a gpu architecture is a newsworthy exception that is quickly corrected.


Honestly at a loss why this got downvoted.


Did the ROCm 5.6 toolchain work for you out of the box? If not, what sort of hacking / hand holding did it need?

I don't know whether there's a LLM inference benchmark in the CI suite, if not perhaps something like this should be included in it.


ROCm has improved a lot over the past few months, and now ROCm 5.6 seems to work out of box by just following this tutorial: https://rocm.docs.amd.com/en/latest/deploy/linux/installer/i.... TVM Unity, the underlying compiler MLC LLM uses, seems to work out of box too on ROCm 5.6 - from Bohan Hou who sets up the environment


Awesome. I'm going to paste that into the rocm dev channel. Actual positive feedback on HN, novel and delightful. Thank you for the blog post too!


https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and https://community.amd.com/t5/rocm/new-rocm-5-6-release-bring... suggest that Linux support is really limited at this point. Is this information inaccurate?


Depends what support means to you really. The docs use support to mean things AMD tested and expect to work, modulo errata.

If you're building the stack from source or found it in a Linux repo, decent odds it'll work for you. More likely to work on gfx9 or gfx10 than the older cards. I think that's roughly the last five years.

If you use the official distribution, some parts are compiled to gpu-specific machine code and if your gpu isn't one of those, you can't use that library. I think there's a reluctance to compile the libs for GPUs that aren't in the internal CI in case they don't work.

As an anecdote, I do most development on unsupported hardware, unsupported distro and unsupported kernel, with the upstream driver, using whatever was on llvm main that morning. That mostly works despite positioning myself as most likely to run into bugs.


I'm still on rocm 5.4, been working great on my 6750XT for the past few months (Arch).


Are there any docker images containing this? I'd like to avoid getting into dependency hell with other software on my system, as happens all too often with new technologies.


There are thankfully, quite a few, ive mostly used rocm/rocm-terminal and rocm/rocm-dev.

https://hub.docker.com/u/rocm


Yes, it works out of box and the blog contains a prebuilt python package that you can try out


Have you tested Vulkan API on the 7900 XTX? Was it faster or slower than ROCm?


Generally speaking I expect Vulkan to be slower than ROCm given it's designed for generic gaming across GPU vendors, so the takeaway is, whenever ROCm is available and usable, we should use ROCm. And it's the same for CUDA vs Vulkan.


What slows it down? Shouldn't Vulkan expose compute queues of the GPUs as well?


I don't have any expectations, but there're reasons for Vulkan to be faster.

It's a mature technology used my millions of people every day.

Unlike GPGPU compute, for videogames performance directly affects usability.

For these reasons, the software on all levels of the stack might be more optimized.


Can I use two at the same time? Two 7900 XTX would be the price of 1 4090 but with much higher performance (260tok/sec)


This is coming! Myself and others at OctoML and in the TVM community are actively working on multi-gpu support in the compiler and runtime. Here are some of the merged and active PRs on the multi-GPU (multi-device) roadmap:

Support in TVM’s graph IR (Relax) - https://github.com/apache/tvm/pull/15447 Support in TVM’s loop IR (TensorIR) - https://github.com/apache/tvm/pull/14862 Distributed dialect of TVM’s graph IR for multi-node (GSPMD-type): https://github.com/apache/tvm/pull/15289

The first target will be LLM's on multiple NVIDIA GPUs but as with all of MLC-LLM effort, the approach will generalize to other hardware including AMD's wonderful hardware.


This exciting, but still it is very apparent more time is needed.


<3


When you say best performance on nvidia, do you mean against any other method of running this model an nvidia card?


I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).


True and there are some other issues to be addressed. Those two particular issue is on our roadmap.

Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.

On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.


yeah we tried out popular solutions like exllama and llama.cpp among others that support inference of 4bit quantized models


Thanks! Just curious why there is no "team" or "about us" page? It's nice sharing credit, but it also is a little unsettling when blog posts do not name contributors.

Good work though. And you have an activity community on github, congratulations.


Well, I'm very much into true open source, and my belief is that any contributor is automatically part of the team :)


I know plenty of open-source projects who list and thank every individual contributor. The website could do that too!


That's a great idea! We should dig around and see if there's any plugin to use


is this similar to the mosaicml amd MI250 vs nvidia A100 results but with consumer grade hardware? https://www.mosaicml.com/blog/amd-mi250

might be interesting to team up


Does it work with WSL2?


Really depends on how good ROCm support for WSL2 is. Our team don't have a windows machine so could not verify ourselves, but if you got ROCm set up properly on WSL2, MLC LLM should work out of the box


You can also try out the vulkan backend, which we know should work for windows, although speed might be slower than rocm


FWIW I did get the CUDA backend running via WSL2


I don't think TVM advertised a lot on its full capabilities, for example, high-perf codegen for dynamic shapes without auto-tuning, or auto-tuning-based codegen, at least in the past few years, and that might be one of the factors it doesn't got a lot of visibility.


I think this is true of AI compilation in general. Torch MLIR, AITemplate and really everything here fly under the radar:

https://github.com/merrymercy/awesome-tensor-compilers#open-...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: