Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz
Regarding batching, it is coming pretty soon, and we will have another blog post on this.
I really liked the style of the instructor (Kolter), and the reason I like this course very much is because each lecture is followed by an implementation video along with the notebook file.
In most Deep Learning courses, the implementation is left to TAs and neither recorded nor made available. This course is an exception. Another bright exception is NYU Deep Learning course [0] by Yann LeCun and Alfredo Canziani. In that course, too, all recitations ("Practica") are recorded and made available. And Canziani is a great teacher.
tbh im not sure what amds plan is on ROCm support on consumer devices, but i dont really think amd is being fraudulent or something.
Both rocm and vulkan are supported in MLC LLM as mentioned in our blog post. we are aware that rocm is not sufficient to cover consumer hardwares, and in this case vulkan is a nice backup!
If you click the "Radeon" tab here[1], dated 27 Jul, AMD claim ROCm support on a wide range of consumer cards, with HIP SDK support on RX 6800 and up, under Windows. The Linux situation seems less clear.
We haven't done any comparison them yet, but generally we believe Vulkan as a more generic cross-vendor API should be slower than ROCm. Same for CUDA vs Vulkan.
There are two points I personally wanted to make through this project:
1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving;
2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.
So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.
Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.
LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.
> Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards?
Thanks for asking! I personally believe TVM Unity is a proper software stack for ML compilation (MLC), and its existing optimizations (e.g. TensorCore offloading) can be transparently transferred to AMD/Intel/Apple/mobile GPUs without too much engineering effort.
Of course my claim is limited to ML workloads. Not an expert outside the ML world, so I couldn't say for general HPC.
Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.
Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)
This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)
Thank you for this work. I will be staying on nvidia for now, but applaud any progress towards much needed credible competition in the consumer/enthusiast AI hardware space.
One question: given your experience, when would you predict a near parity in software stack support between te different platforms, so that a choice of GPU becomes one mostly of price/performance? It does not need to be like the AMD/Intel in the CPU market where a consumer will have no doubts about software compatibility, but let's say like the gaming gpu market where a game having problems on a gpu architecture is a newsworthy exception that is quickly corrected.
ROCm has improved a lot over the past few months, and now ROCm 5.6 seems to work out of box by just following this tutorial: https://rocm.docs.amd.com/en/latest/deploy/linux/installer/i.... TVM Unity, the underlying compiler MLC LLM uses, seems to work out of box too on ROCm 5.6 - from Bohan Hou who sets up the environment
Depends what support means to you really. The docs use support to mean things AMD tested and expect to work, modulo errata.
If you're building the stack from source or found it in a Linux repo, decent odds it'll work for you. More likely to work on gfx9 or gfx10 than the older cards. I think that's roughly the last five years.
If you use the official distribution, some parts are compiled to gpu-specific machine code and if your gpu isn't one of those, you can't use that library. I think there's a reluctance to compile the libs for GPUs that aren't in the internal CI in case they don't work.
As an anecdote, I do most development on unsupported hardware, unsupported distro and unsupported kernel, with the upstream driver, using whatever was on llvm main that morning. That mostly works despite positioning myself as most likely to run into bugs.
Are there any docker images containing this? I'd like to avoid getting into dependency hell with other software on my system, as happens all too often with new technologies.
Generally speaking I expect Vulkan to be slower than ROCm given it's designed for generic gaming across GPU vendors, so the takeaway is, whenever ROCm is available and usable, we should use ROCm. And it's the same for CUDA vs Vulkan.
This is coming! Myself and others at OctoML and in the TVM community are actively working on multi-gpu support in the compiler and runtime. Here are some of the merged and active PRs on the multi-GPU (multi-device) roadmap:
The first target will be LLM's on multiple NVIDIA GPUs but as with all of MLC-LLM effort, the approach will generalize to other hardware including AMD's wonderful hardware.
True and there are some other issues to be addressed. Those two particular issue is on our roadmap.
Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.
On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.
Thanks! Just curious why there is no "team" or "about us" page? It's nice sharing credit, but it also is a little unsettling when blog posts do not name contributors.
Good work though. And you have an activity community on github, congratulations.
Really depends on how good ROCm support for WSL2 is. Our team don't have a windows machine so could not verify ourselves, but if you got ROCm set up properly on WSL2, MLC LLM should work out of the box
I don't think TVM advertised a lot on its full capabilities, for example, high-perf codegen for dynamic shapes without auto-tuning, or auto-tuning-based codegen, at least in the past few years, and that might be one of the factors it doesn't got a lot of visibility.