Nvidia's 900 tons of GPU muscle bulks up server market, slims down wallets

pineapple_guy · on Sept 19, 2023

And they are simply unable to reliably release Linux drivers

m00x · on Sept 19, 2023

Linux graphics you mean? The compute ones are fine.

It's not as much of an issue about supporting old versions, it's just that new versions of Linux breaks the drivers since they don't know what's in it.

MenhirMike · on Sept 19, 2023

> unable

Correction: unwilling, not unable

redox99 · on Sept 19, 2023

Almost 100% of their datacenter GPUs are used on Linux. Do do think they run without drivers?

michaelt · on Sept 19, 2023

They run with drivers, of course.

But such servers have no need to, for example, suspend and resume correctly. Or handle hot-plugging of displays correctly. Or install updates in a completely reliable manner. Or include the 32 bit support needed for Steam to work.

alecco · on Sept 19, 2023

Or support Wayland.

heroprotagonist · on Sept 20, 2023

That driver also costs $450 per GPU per year..

Source:

https://www.nvidia.com/content/dam/en-zz/Solutions/design-vi...

jerjerjer · on Sept 19, 2023

Very strange metric considering actual GPU is probably below 1% of the shipped weight.

green-salt · on Sept 19, 2023

I'm glad they're measuring by weight instead of volume for more accuracy

a1o · on Sept 19, 2023

If someone can come up with the average density of a video card we can convert to volume!

mensetmanusman · on Sept 19, 2023

Probably silicon is a good enough approximation

https://www.wolframalpha.com/input?i=density+of+silicon

solardev · on Sept 19, 2023

Those heat sinks though...

jerjerjer · on Sept 20, 2023

I count only the silicon die.

slashdev · on Sept 19, 2023

Pardon me for the stupid question, but wouldn’t you get much better bang for the buck with high end consumer GPUs from Nvidia? I think you’d be limited to 16gb memory per card, but you could have a full 8 card system for less than the price of one H100. Is there really no way to partition the workload to run with 16gb memory per card?

doctorpangloss · on Sept 19, 2023

> but wouldn’t you get much better bang for the buck with high end consumer GPUs from Nvidia

It depends on the workload, but generally for large neural network model tasks, no, while for most other accelerated tasks, yes.

> Is there really no way to partition the workload to run with 16gb memory per card?

With regards to the neural networks in the news, you can partition the workloads. The 8xH100 machines are also partitioned in the same way. Nonetheless the performance is very poor on the consumer cards, because they lack the very high speed interconnect of the H100 that makes partitioning performant. Additionally, the size of the RAM on the H100 is matched to the capabilities of the chip itself - it sits in at the peak of an upside down U of a performance curve - which is why it is not possible to "just" add more RAM to these chips.

redox99 · on Sept 19, 2023

The 24GB of VRAM of the 3090/4090 (instead of 80GB) is a deal breaker for a lot of stuff. Also 3090 only does 2x nvlink (and 4090 has no nvlink).

But a lot of people (including me) run 3090/4090 rigs at home, or rent them at runpod/vastai to save a buck.

blopker · on Sept 19, 2023

Since you seem to know and I'm having a rough time finding good info, I'd love to put together a mid range 3090 box to mess around with LLMs. Do you have any pointers to a good build list?

redox99 · on Sept 19, 2023

> to mess around with LLMs

You mean training or inference?

chrisandchris · on Sept 19, 2023

There's some EUA that prohibts you from using consumer-grade GPUs in server or renting them (can't remember the exact rule, but for sure it's not allowed for Amazon or others to use them).

latchkey · on Sept 19, 2023

https://www.nvidia.com/en-us/drivers/geforce-license/

"No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted."

moffkalast · on Sept 19, 2023

Haha so as long as you append some light proof of work to every task, you get them on a technicality?

Honestly though, who the fuck are they to tell you what to do with the hardware that you bought? Like telling people they are not allowed to drive their car on gravel roads or not being allowed to drink wine from a beer glass without paying for a special overpriced version or something.

latchkey · on Sept 19, 2023

Usually it involves warranty issues. Although in this case, it seems especially odd since this clause was added to effectively force data centers to buy their more expensive hardware. It goes back a while (~2018), so you'd think they would have revised this policy:

https://www.datacenterdynamics.com/en/news/nvidia-updates-ge...

dotnet00 · on Sept 19, 2023

It mostly just means that don't expect them to provide you any support for that use case. It isn't all that uncommon for smaller studios to just cluster together a couple of computers with 3-4 high end gaming GPUs each as a fairly powerful rendering farm for instance.

eternityforest · on Sept 19, 2023

Why the #-@-#&# do they allow blockchains??? I guess they probably figured out it's good for the bottom line and all, but it's pretty obnoxious to allow blockchains, which seems to have been the cause of the shortages, while banning everything else.

latchkey · on Sept 20, 2023

After ETH switched from PoW to PoS, it had the effect of decimating the entire GPU mining market. Very few GPUs are being used for crypto any more.

If there are shortages, these days it is because of ML/AI/Gaming.

In fact, there are shortages... did you know it is nearly impossible to get data center space of any size for any decent price? AI has eaten it all up.

On top of it, AI doesn't care about the environment. There is so much money being dumped into it right now, that people are pulling power from wherever they can get it and at any cost.

On the bright side, at least crypto tried to go the cheapest routes... which happened to be mostly green based power.

So yea... be mad at AI now.

slashdev · on Sept 19, 2023

Can I stick it in the corner of the office? Serious question.

wongarsu · on Sept 19, 2023

Yes.

Realistically you can also build a server with 8 RTX 3090 and put in in colo in a datacenter. That would technically violate the license, but nobody is going to go after you for that. What Nvidia wants to prevent is hosting providers/IaaS renting out servers with consumer GPUs.

latchkey · on Sept 19, 2023

Is your office a data center?

WithinReason · on Sept 19, 2023

That's the question.

slashdev · on Sept 19, 2023

Could you use them for your own purposes? Not to rent/resell?

At that price differential, on premises might make much more sense, provided you have decent utilization. Your breakeven is likely well under 10%.

Scaevolus · on Sept 19, 2023

You can't legally use their drivers, particularly CUDA, for consumer gear in a datacenter.

jacquesm · on Sept 19, 2023

Has that clause ever been tested? You'd say this is against the doctrine of first sale.

https://www.justice.gov/archives/jm/criminal-resource-manual...

alecco · on Sept 19, 2023

GPU cores are powerful but you need to feed them data at very high speeds and large models require to pass around a lot of data. This is why Nvidia datacenter products have NVLink and Infiniband so Big Corp can run their massive models.

In a recent paper I was reading they were bragging they "only" needed 8 x A100 to train their model.

Renting a 4090 24GB costs around $.40/hr. Renting A100 80GB costs around $3.4/hr. For 3x memory and 2x bandwidth.

To fit a 13B model in a 4090 you need to quantize it. Imagine a 130B or 200B model. And never mind training, which needs a lot more memory.

The divide is crazy. If you are not FAANG you are "GPU-poor".

redox99 · on Sept 19, 2023

You can train a 13B QLoRA (maybe even an 8bit LoRA, I don't remember) with a 3090/4090.

dauertewigkeit · on Sept 19, 2023

> Is there really no way to partition the workload to run with 16gb memory per card?

It depends on the model architecture you are using. Once you cannot fit a single instance of your model on a single GPU or at minimum on a single node, things start becoming very complicated. If you are lucky and you have a generic transformer model, you can just use Deepspeed with their transformer kernel. But if you have another architecture it will likely not be compatible with Deepspeed or Fairscale or any of the other scaling frameworks and you will end up having to write your own CUDA kernels.

So the per GPGPU RAM is quite important.

breakds · on Sept 19, 2023

It is not only about computation, but also the communication bandwidth and memory, especially in training large models. The top consumer GPUs e.g. 3090/4090 still cannot beat H100 in this area even if a lot of techniques are applied.

jdiez17 · on Sept 19, 2023

The Nvidia driver license for consumer hardware doesn't allow you to virtualize GPUs.

godelski · on Sept 19, 2023

> Is there really no way to partition the workload to run with 16gb memory per card?

It really depends and this can get really complicated really fast. I'll give a tldr and then a longer explanation.

TLDR:

Yes, you can easily split networks up. If your main bottleneck is batch size (i.e. training) then there aren't huge differences in spreading across multiple GPUs assuming you have good interconnects (GPU direct is supported). If you're running inference and the model fits on the card you're probably fine too unless you need to do things like fancy inference batching (i.e. you have LOTS of users)

Longer version:

You can always split things up. If we think about networks we recognize some nice properties about how they operate as mathematical groups. Non-residual networks are compositional, meaning each layer can be treated as a sub network (every residual block can be treated this way too). Additionally, we may have associative and distributive properties depending on the architecture (some even have commutative!). So we can use these same rules to break apart networks in many different ways. There are often performance hits for doing this though, as it practically requires you touching the disk more often but in some more rare cases (at least to me, let me know if you know more) they can help.

I mentioned the batching above and this can get kinda complicated. There are actually performance differences when you batch in groups of data (i.e. across GPUs) compared to batching on a single accelerator. This difference isn't talked about a lot. But it is going to come down to how often your algorithm depends on batching and what operations are used, such as batch norm. The batch norm is calculated across the GPU's batch, not the distributed batch (unless you introduce blocking). This is because your gradients AND inference are going to be computed differently. In DDP your whole network is cloned across cards so you basically run inference on multiple networks and then do an all reduce on the loss then calculate the gradient and then recopy the weights to all cards. There is even a bigger difference when you use lazy regularization (don't compute gradients for n-minibatches). GANs are notorious for using this and personally I've seen large benefits to distributed training for these. GANs usually have small batch sizes and aren't getting anywhere near the memory of the card anyways (GANs are typically unstable so large batch sizes can harm them), but also pay attention to this when evaluating papers (of course as well as how much hyper-parameter tuning has been done. This is always tricky when comparing works, especially between academia and big labs. You can easily be fooled by which is a better model. Evaluating models is way tougher than people give credit to and especially in the modern era of LLMs. I could rant a lot about just this alone). Basically in short, we can think of this as an ensembling method, except our models are actually identical (you could parallel reduce lazily too and that will create some periodic divergence between your models but that's not important for conceptually understanding, just worth noting).

There is are also techniques to split a single model up called model sharding and checkpointing. Model sharding is where you split a single model across multiple GPUs. You're taking advantage of the compositional property of networks, meaning that as long as there isn't a residual layer between your split location you can actually treat one network as a series of smaller networks. This has obvious drawbacks as you need to feed one into another and so the operations have to be synchronous, but sometimes this isn't too bad. Checkpointing is very similar but you're just doing the same thing on the same GPU. Your hit here is in I/O, but may or may not be too bad with GPU Direct and highly depends on your model size (were you splitting because batch size or because model size?).

This is all still pretty high level but if you want to dig into it more META developed a toolkit called fairseq that will do a lot of this for you and they optimized it

https://engineering.fb.com/2021/07/15/open-source/fsdp/

https://github.com/facebookresearch/fairseq

TLDR: really depends on your use case, but it is a good question. I/O is an asshole

_joel · on Sept 19, 2023

Now that's what I call Big Iron