Sunday morning, and I find myself wondering how the engineering tinkerer is supp...

giobox · 2025-11-02T16:02:41 1762099361

If you just want to get something running locally as fast as possible to play with (the 2080ti typically had 11gb of VRAM which will be one of the main limiting factors), the ollama app will run most of these models locally with minimum user effort:

https://ollama.com/

If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it!

jlokier · 2025-11-02T17:25:05 1762104305

I use a Macbook Pro with 128GB RAM "unified memory" that's available to both CPU and GPU.

It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.

I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.

anon373839 · 2025-11-02T20:26:53 1762115213

I’d strongly advise ditching Ollama for LM Studio, and using MLX versions of the models. They run quite a bit faster on Apple Silicon. Also, LM Studio is much more polished and feature rich than Ollama.

terhechte · 2025-11-02T21:04:18 1762117458

Fully agree to this. LM Studio is much nicer to use and with MLX faster on Apple Silicon

MaxMatti · 2025-11-02T19:29:23 1762111763

How's the battery holding up during vibe coding sessions or occasional LLM usage? I've been thinking about getting a MacBook or a laptop with a similar Ryzen chip specifically for that reason.

jlokier · 2025-11-09T16:38:16 1762706296

Currently I don't use vibe coding or even code assistants, so I can't speak to how the battery fares when doing that sort of thing. I don't know how much or how intensively they need to run the underlying LLMs.

For chatting with LLMs via ollama, I've seen total power usage go to about 50W (on an M3 Max) while the LLM is active, which is about 3x-4x power usage compared to just idling with browsers and editors open.

So I'd estimate about 2-3 hours of continuous LLM use on battery. Because I have enough RAM spare, at least there's no need to keep shutting down and reloading models.

I haven't really pushed it to find out how long they run on battery, as I haven't used LLMs all that much.

I'm more interested in the underlying operations of how they work, investigating novel model architectures and techniques, and optimising performance, than actually using them as an end user :-) Similar to how I enjoyed writing game engines more than playing games :-) Maybe I'll get into using them more in future.

btbuildem · 2025-11-02T17:01:03 1762102863

I've recently put together a setup that seemed reasonable for my limited budget. Mind you, most of the components were second-hand, open box deals, or deep discount of the moment.

This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.

- Ryzen 9 9950X

- MSI MPG X670E Carbon

- 96GB RAM

- 2x RTX 3090 (24GB VRAM each)

- 1600W PSU

nine_k · 2025-11-02T19:10:17 1762110617

Does it offer more performance than a Macbook Pro that could be had for a comparable sum? Your build can be had for under $3k; a used MBP M3 with 64 GB RAM can be had for approximately $3.5k.

btbuildem · 2025-11-02T19:47:48 1762112868

I'm not sure, I did not run any benchmarks. As a ballpark figure -- with both cards throttled down to 250W, running a Qwen-30B FP8 model (variant depending on task), I get upwards of 60 tok/sec. It feels on par with the premium models, tbh.

Of course this is in a single-user environment, with vLLM keeping the model warm.

bee_rider · 2025-11-03T00:37:00 1762130220

MacBooks have some clever chips, but 2x 3090 is a lot of brawn to overcome.

pstuart · 2025-11-02T18:10:10 1762107010

That's basically what I imagined would be my rig if I were to pull the trigger. Do you have an NVLink adapter as well?

btbuildem · 2025-11-02T19:42:26 1762112546

No NVLink; it took me a long time to compose the exact hardware specs, because I wanted to optimize performance. Both cards are on x8 PCIe direct CPU channels, close to their max throughput anyway. It runs hot with the CPU engaged, but it runs fast.

PeterStuer · 2025-11-02T21:25:49 1762118749

Unfortunately the RTX 3090 has no native FP8 support.

jwr · 2025-11-02T18:10:46 1762107046

I just use my laptop. A modern MacBook Pro will run ~30B models very well. I normally stick to "Max" CPUs (initially for more performance cores, recently also for the GPU power) with 64GB of RAM. My next update will probably be to 128GB of RAM, because 64GB doesn't quite cut it if you want to run large Docker containers and LLMs.

Lapel2742 · 2025-11-03T08:27:09 1762158429

> I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly.

I think you mean ram and no vram. AFAIK this is a 30b moe model with 3b active parameters. Comparable to the Qwen3 MOE model. If you do not expect 60 tps such models should run sufficiently fast.

I run the Qwen3 MOE Model (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/...) in 4-bit quantization on an 11 year old i5-6600 (32GB) and a Radeon 6600 with 8GB. According to a quick search your card is faster than that and I get ~12 tps with 16k context on Llama.cpp, which is ok for playing around.

My Radeon (ROCm) specific batch file to start this:

llama-server --ctx-size 16384 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --device ROCm0 -ngl -1 --model /usr/local/share/gguf/Qwen3-30B-A3B-Q4_0.gguf --cache-ram 16384 --cpu-moe --numa distribute --override-tensor "\.ffn_.*_exps\.weight=CPU" --jinja --temp 0.7 --port 8080

yencabulator · 2025-11-11T23:32:58 1762903978

> I get ~12 tps with 16k context

FWIW Ollama at its defaults with qwen3:30b-a3b has 256k context size and does ~27 tokens/sec on pure CPU on a $450 mini PC with AMD Ryzen 9 8945HS. Unless you need a room heater, that GPU isn't pulling its weight.

exe34 · 2025-11-02T16:23:00 1762100580

llama.cpp + quantized: https://huggingface.co/bartowski/Alibaba-NLP_Tongyi-DeepRese...

get the biggest one that will fit in your vram.

trebligdivad · 2025-11-02T21:11:55 1762117915

How do people deal with all the different quantisations? Generally if I see an Unsloth I'm happy to try it locally; random other peoples...how do I know what I'm getting?

(If nothing else Tongyi are currently winning AI with cutest logo)

exe34 · 2025-11-02T22:40:52 1762123252

personally I've only used them for toying around - but in all cases you have to test them for your use case anyway.

davidsainez · 2025-11-02T18:59:59 1762109999

This is the way. I managed to run (super) tiny models on CPU only with this approach.

greggh · 2025-11-03T14:44:21 1762181061

If you really need a lot of VRAM cheap rocm still supports the amd MI50 and you can get 32gb versions of the MI50 on alibaba/aliexpress for around $150-$250 each. A few people on r/localllama have shown setups with multiple MI50s running with 128gb of VRAM and doing a decent job with large models. Obviously it won't running as fast as any brand new GPUs because of memory bandwidth and a few other things, but more than fast enough to be usable.

This can end up getting you 128gb of VRAM for under $1000.

homarp · 2025-11-02T15:22:47 1762096967

llama.cpp gives you the most control to tune it for your machine.

CuriousSkeptic · 2025-11-02T16:16:53 1762100213

Im sure this guy has some helpful hints on that: https://youtube.com/@azisk

sumo43 · 2025-11-02T18:24:56 1762107896

Try running this using their harness https://huggingface.co/flashresearch/FlashResearch-4B-Thinki...

aliljet · 2025-11-03T01:41:35 1762134095

oh my god. 128 gb of RAM! way too late to repair this thread, but most people caught this.

sigmarule · 2025-11-03T04:52:15 1762145535

The Framework Desktop runs this perfectly well, and for just about $2k.

3abiton · 2025-11-02T21:40:16 1762119616

As many pointed out, Macs are decent enough to run them (with maxxed rams). You also have more alternative, like DGX Sparks (if you appreciate the ease of cuda, albeit a tad bit slower token generation performance), or the Strix Halo (good luck with ROCm though, AMD still peddling hype). There is no straitghtforwars "cheap" answer. You either go big (gpu server), or compromise. Either way use either vllm or sglang, or llama.cpp. ollama is just inferior in every way to llama.cpp.