Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.
Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)
If you just want to get something running locally as fast as possible to play with (the 2080ti typically had 11gb of VRAM which will be one of the main limiting factors), the ollama app will run most of these models locally with minimum user effort:
I use a Macbook Pro with 128GB RAM "unified memory" that's available to both CPU and GPU.
It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.
I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.
I’d strongly advise ditching Ollama for LM Studio, and using MLX versions of the models. They run quite a bit faster on Apple Silicon. Also, LM Studio is much more polished and feature rich than Ollama.
How's the battery holding up during vibe coding sessions or occasional LLM usage? I've been thinking about getting a MacBook or a laptop with a similar Ryzen chip specifically for that reason.
Currently I don't use vibe coding or even code assistants, so I can't speak to how the battery fares when doing that sort of thing. I don't know how much or how intensively they need to run the underlying LLMs.
For chatting with LLMs via ollama, I've seen total power usage go to about 50W (on an M3 Max) while the LLM is active, which is about 3x-4x power usage compared to just idling with browsers and editors open.
So I'd estimate about 2-3 hours of continuous LLM use on battery. Because I have enough RAM spare, at least there's no need to keep shutting down and reloading models.
I haven't really pushed it to find out how long they run on battery, as I haven't used LLMs all that much.
I'm more interested in the underlying operations of how they work, investigating novel model architectures and techniques, and optimising performance, than actually using them as an end user :-) Similar to how I enjoyed writing game engines more than playing games :-) Maybe I'll get into using them more in future.
I've recently put together a setup that seemed reasonable for my limited budget. Mind you, most of the components were second-hand, open box deals, or deep discount of the moment.
This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.
Does it offer more performance than a Macbook Pro that could be had for a comparable sum? Your build can be had for under $3k; a used MBP M3 with 64 GB RAM can be had for approximately $3.5k.
I'm not sure, I did not run any benchmarks. As a ballpark figure -- with both cards throttled down to 250W, running a Qwen-30B FP8 model (variant depending on task), I get upwards of 60 tok/sec. It feels on par with the premium models, tbh.
Of course this is in a single-user environment, with vLLM keeping the model warm.
No NVLink; it took me a long time to compose the exact hardware specs, because I wanted to optimize performance. Both cards are on x8 PCIe direct CPU channels, close to their max throughput anyway. It runs hot with the CPU engaged, but it runs fast.
I just use my laptop. A modern MacBook Pro will run ~30B models very well. I normally stick to "Max" CPUs (initially for more performance cores, recently also for the GPU power) with 64GB of RAM. My next update will probably be to 128GB of RAM, because 64GB doesn't quite cut it if you want to run large Docker containers and LLMs.
> I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly.
I think you mean ram and no vram. AFAIK this is a 30b moe model with 3b active parameters. Comparable to the Qwen3 MOE model. If you do not expect 60 tps such models should run sufficiently fast.
I run the Qwen3 MOE Model (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/...) in 4-bit quantization on an 11 year old i5-6600 (32GB) and a Radeon 6600 with 8GB. According to a quick search your card is faster than that and I get ~12 tps with 16k context on Llama.cpp, which is ok for playing around.
My Radeon (ROCm) specific batch file to start this:
FWIW Ollama at its defaults with qwen3:30b-a3b has 256k context size and does ~27 tokens/sec on pure CPU on a $450 mini PC with AMD Ryzen 9 8945HS. Unless you need a room heater, that GPU isn't pulling its weight.
How do people deal with all the different quantisations? Generally if I see an Unsloth I'm happy to try it locally; random other peoples...how do I know what I'm getting?
(If nothing else Tongyi are currently winning AI with cutest logo)
If you really need a lot of VRAM cheap rocm still supports the amd MI50 and you can get 32gb versions of the MI50 on alibaba/aliexpress for around $150-$250 each. A few people on r/localllama have shown setups with multiple MI50s running with 128gb of VRAM and doing a decent job with large models. Obviously it won't running as fast as any brand new GPUs because of memory bandwidth and a few other things, but more than fast enough to be usable.
This can end up getting you 128gb of VRAM for under $1000.
As many pointed out, Macs are decent enough to run them (with maxxed rams). You also have more alternative, like DGX Sparks (if you appreciate the ease of cuda, albeit a tad bit slower token generation performance), or the Strix Halo (good luck with ROCm though, AMD still peddling hype). There is no straitghtforwars "cheap" answer. You either go big (gpu server), or compromise. Either way use either vllm or sglang, or llama.cpp. ollama is just inferior in every way to llama.cpp.
Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)