Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The `big' AI models are trillion parameter models.

The medium sized models like GPT3 and Grok are 185b and 314b respectively.

There is no way for _anyone_ to run these on a sub $50k machine in 2024, and even if you can the token generation speed on CPU is under 0.1 tokens per second.



It's just semantic gymnastics. I'm sure most people will consider LLaMa 70B a big model. Of course if you define big = trillion then sure big = trillion[1].

[1]: https://en.wikipedia.org/wiki/No_true_Scotsman


Yes, you are engaging in the no true scotsman fallacy, please stop.


You can get registered DDR4 for ~$1/GB. A trillion parameter model in FP16 would need ~2TB. Servers that support that much are actually cheap (~$200), the main cost would be the ~$2000 in memory itself. That is going to be dog slow but you can certainly do it if you want to and it doesn't cost $50,000.


Even looking on Amazon, DDR4 seems still a decent bit above $2/GB:

2 x 32GB: $142

2 x 64GB: $318

8GB: $16

2 x 16GB: $64

2TB of 128GB DDR4 ECC: $9,600 (https://www.amazon.com/NEMIX-RAM-Registered-Compatible-Mothe...)

> Servers that support that much are actually cheap (~$200)

What does this mean? What motherboards support 2TB of RAM at $200? Most of them are pushing $1,000. With no CPU.

It may not hit $50K, but it's definitely not going to be $2K.


Here's a server that supports 3TB of memory for $130, you get 3TB by filling all 24 memory slots with 128GB LRDIMMs, 2TB with 16:

https://www.ebay.com/itm/176298520843

Here are 128GB LRDIMMs for $98:

https://www.ebay.com/itm/196305803969

For 2TB and the server you're at $1698. You can get a drive bracket for a few bucks and a 2TB SSD for $100 and have almost $200 left over to put faster CPUs in it if you want to.

That's stinking Optane, would work if you're desperate. Normal 128GB LRDIMMs cost more than other DDR4 DIMMs. You can, however, get DDR4 RDIMMs for ~$1/GB:

https://www.ebay.com/itm/186345903230

With 32GB RDIMMs that machine would max out at 768GB, which could still run a 1T model at q4 or grok at FP16. And then it would cost less than $1000.

Or find a quad-socket system with 48 memory slots and then use 64GB LRDIMMs ($1.12/GB):

https://www.ebay.com/itm/176299295509

The quad socket systems aren't $200, but you can find them for $550 or so:

https://www.newegg.com/hp-proliant-rack-mount/p/2NS-0006-3E5...

Maybe less if you shop around (they're not as common).


How slow? Depending on the task I fear it could be too slow to be useful.

I believe there is some research on how to distribute large models across multiple GPUs, which could make the cost less lumpy.


You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth. That's assuming it's well-optimized and memory rather than compute bound, but those are often both true or pretty close.

And "depending on the task" is the point. There are systems that would be uselessly slow for real-time interaction but if your concern is to have it process confidential data you don't want to upload to a third party you can just let it run and come back whenever it finishes. And releasing the model allows people to do the latter even if machines necessary to do the former are still prohibitively expensive.

Also, hardware gets cheaper over time and it's useful to have the model out there so it's well-optimized and stable by the time fast hardware becomes affordable instead of waiting for the hardware and only then getting to work on the code.


Why would increasing memory bandwidth reduce performance? You said "You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth"


Yeah the sentence is backwards, you divide the system's memory bandwidth by the size of the model.


Mixtral 8x7b is better than both of those and runs on a top spec M3 Max wonderfully.


Indeed! Also, Mixtral 8x7b runs just as well on older M1 Max and M2 Max Macs, since LLM inference is memory bandwidth bound and memory bandwidth hasn't significantly changed between M1 and M3.


It didn't change at all, rather was reduced in certain configurations.


I will make a 2 trillion parameter model just so your comment becomes outdated and wrong.


I approve this comment.


ChatGPT is 20B according to Microsoft researchers, also the fact that big AI models are trillion parameter models is mostly speculation, about GPT-4 it was spread by geohot.


To be precise, ChatGPT 3.5 turbo being 20B is officially a mistake from a Microsoft Researcher, quoting a wrong source published before the release of chatgpt3.5 turbo. Up to you to believe it or not. But I wouldn’t claim it’s a 20B according to Microsoft Researchers.

The withdrawn paper: https://arxiv.org/abs/2310.17680

The wrong source: https://www.forbes.com/sites/forbestechcouncil/2023/02/17/is...

The discussion: https://www.reddit.com/r/LocalLLaMA/comments/17jrj82/new_mic...


It's interesting how the paper was completely retracted instead of just being corrected.


Yep. It feels like a 20B parameter model.


GPT-3 was 175B, so it'd be a bit odd if GPT-4 wasn't at least 5x larger (1T), especially since it's apparently a mixture of experts.


I think it became apparent when mixtral came out. I've noticed too during training that my model overwrites useful information so it makes sense for these types of models to have emerged.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: