The `big' AI models are trillion parameter models. The medium sized models like ...

raincole · on March 23, 2024

It's just semantic gymnastics. I'm sure most people will consider LLaMa 70B a big model. Of course if you define big = trillion then sure big = trillion[1].

[1]: https://en.wikipedia.org/wiki/No_true_Scotsman

llm_trw · on March 24, 2024

Yes, you are engaging in the no true scotsman fallacy, please stop.

AnthonyMouse · on March 23, 2024

You can get registered DDR4 for ~$1/GB. A trillion parameter model in FP16 would need ~2TB. Servers that support that much are actually cheap (~$200), the main cost would be the ~$2000 in memory itself. That is going to be dog slow but you can certainly do it if you want to and it doesn't cost $50,000.

FireBeyond · on March 23, 2024

Even looking on Amazon, DDR4 seems still a decent bit above $2/GB:

2 x 32GB: $142

2 x 64GB: $318

8GB: $16

2 x 16GB: $64

2TB of 128GB DDR4 ECC: $9,600 (https://www.amazon.com/NEMIX-RAM-Registered-Compatible-Mothe...)

> Servers that support that much are actually cheap (~$200)

What does this mean? What motherboards support 2TB of RAM at $200? Most of them are pushing $1,000. With no CPU.

It may not hit $50K, but it's definitely not going to be $2K.

AnthonyMouse · on March 24, 2024

Here's a server that supports 3TB of memory for $130, you get 3TB by filling all 24 memory slots with 128GB LRDIMMs, 2TB with 16:

https://www.ebay.com/itm/176298520843

Here are 128GB LRDIMMs for $98:

https://www.ebay.com/itm/196305803969

For 2TB and the server you're at $1698. You can get a drive bracket for a few bucks and a 2TB SSD for $100 and have almost $200 left over to put faster CPUs in it if you want to.

That's stinking Optane, would work if you're desperate. Normal 128GB LRDIMMs cost more than other DDR4 DIMMs. You can, however, get DDR4 RDIMMs for ~$1/GB:

https://www.ebay.com/itm/186345903230

With 32GB RDIMMs that machine would max out at 768GB, which could still run a 1T model at q4 or grok at FP16. And then it would cost less than $1000.

Or find a quad-socket system with 48 memory slots and then use 64GB LRDIMMs ($1.12/GB):

https://www.ebay.com/itm/176299295509

The quad socket systems aren't $200, but you can find them for $550 or so:

https://www.newegg.com/hp-proliant-rack-mount/p/2NS-0006-3E5...

Maybe less if you shop around (they're not as common).

fauigerzigerk · on March 23, 2024

How slow? Depending on the task I fear it could be too slow to be useful.

I believe there is some research on how to distribute large models across multiple GPUs, which could make the cost less lumpy.

AnthonyMouse · on March 23, 2024

You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth. That's assuming it's well-optimized and memory rather than compute bound, but those are often both true or pretty close.

And "depending on the task" is the point. There are systems that would be uselessly slow for real-time interaction but if your concern is to have it process confidential data you don't want to upload to a third party you can just let it run and come back whenever it finishes. And releasing the model allows people to do the latter even if machines necessary to do the former are still prohibitively expensive.

Also, hardware gets cheaper over time and it's useful to have the model out there so it's well-optimized and stable by the time fast hardware becomes affordable instead of waiting for the hardware and only then getting to work on the code.

bawana · on March 23, 2024

Why would increasing memory bandwidth reduce performance? You said "You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth"

AnthonyMouse · on March 24, 2024

Yeah the sentence is backwards, you divide the system's memory bandwidth by the size of the model.

renewiltord · on March 23, 2024

Mixtral 8x7b is better than both of those and runs on a top spec M3 Max wonderfully.

woadwarrior01 · on March 23, 2024

Indeed! Also, Mixtral 8x7b runs just as well on older M1 Max and M2 Max Macs, since LLM inference is memory bandwidth bound and memory bandwidth hasn't significantly changed between M1 and M3.

karolist · on March 23, 2024

It didn't change at all, rather was reduced in certain configurations.

Takennickname · on March 23, 2024

I will make a 2 trillion parameter model just so your comment becomes outdated and wrong.

razodactyl · on March 25, 2024

I approve this comment.

GaggiX · on March 23, 2024

ChatGPT is 20B according to Microsoft researchers, also the fact that big AI models are trillion parameter models is mostly speculation, about GPT-4 it was spread by geohot.

speedgoose · on March 23, 2024

To be precise, ChatGPT 3.5 turbo being 20B is officially a mistake from a Microsoft Researcher, quoting a wrong source published before the release of chatgpt3.5 turbo. Up to you to believe it or not. But I wouldn’t claim it’s a 20B according to Microsoft Researchers.

The withdrawn paper: https://arxiv.org/abs/2310.17680

The wrong source: https://www.forbes.com/sites/forbestechcouncil/2023/02/17/is...

The discussion: https://www.reddit.com/r/LocalLLaMA/comments/17jrj82/new_mic...

GaggiX · on March 23, 2024

It's interesting how the paper was completely retracted instead of just being corrected.

razodactyl · on March 25, 2024

Yep. It feels like a 20B parameter model.

HarHarVeryFunny · on March 23, 2024

GPT-3 was 175B, so it'd be a bit odd if GPT-4 wasn't at least 5x larger (1T), especially since it's apparently a mixture of experts.

razodactyl · on March 25, 2024

I think it became apparent when mixtral came out. I've noticed too during training that my model overwrites useful information so it makes sense for these types of models to have emerged.