Linux graphics you mean? The compute ones are fine.
It's not as much of an issue about supporting old versions, it's just that new versions of Linux breaks the drivers since they don't know what's in it.
But such servers have no need to, for example, suspend and resume correctly. Or handle hot-plugging of displays correctly. Or install updates in a completely reliable manner. Or include the 32 bit support needed for Steam to work.
Pardon me for the stupid question, but wouldn’t you get much better bang for the buck with high end consumer GPUs from Nvidia? I think you’d be limited to 16gb memory per card, but you could have a full 8 card system for less than the price of one H100. Is there really no way to partition the workload to run with 16gb memory per card?
> but wouldn’t you get much better bang for the buck with high end consumer GPUs from Nvidia
It depends on the workload, but generally for large neural network model tasks, no, while for most other accelerated tasks, yes.
> Is there really no way to partition the workload to run with 16gb memory per card?
With regards to the neural networks in the news, you can partition the workloads. The 8xH100 machines are also partitioned in the same way. Nonetheless the performance is very poor on the consumer cards, because they lack the very high speed interconnect of the H100 that makes partitioning performant. Additionally, the size of the RAM on the H100 is matched to the capabilities of the chip itself - it sits in at the peak of an upside down U of a performance curve - which is why it is not possible to "just" add more RAM to these chips.
Since you seem to know and I'm having a rough time finding good info, I'd love to put together a mid range 3090 box to mess around with LLMs. Do you have any pointers to a good build list?
There's some EUA that prohibts you from using consumer-grade GPUs in server or renting them (can't remember the exact rule, but for sure it's not allowed for Amazon or others to use them).
Haha so as long as you append some light proof of work to every task, you get them on a technicality?
Honestly though, who the fuck are they to tell you what to do with the hardware that you bought? Like telling people they are not allowed to drive their car on gravel roads or not being allowed to drink wine from a beer glass without paying for a special overpriced version or something.
Usually it involves warranty issues. Although in this case, it seems especially odd since this clause was added to effectively force data centers to buy their more expensive hardware. It goes back a while (~2018), so you'd think they would have revised this policy:
It mostly just means that don't expect them to provide you any support for that use case. It isn't all that uncommon for smaller studios to just cluster together a couple of computers with 3-4 high end gaming GPUs each as a fairly powerful rendering farm for instance.
Why the #-@-#&# do they allow blockchains??? I guess they probably figured out it's good for the bottom line and all, but it's pretty obnoxious to allow blockchains, which seems to have been the cause of the shortages, while banning everything else.
After ETH switched from PoW to PoS, it had the effect of decimating the entire GPU mining market. Very few GPUs are being used for crypto any more.
If there are shortages, these days it is because of ML/AI/Gaming.
In fact, there are shortages... did you know it is nearly impossible to get data center space of any size for any decent price? AI has eaten it all up.
On top of it, AI doesn't care about the environment. There is so much money being dumped into it right now, that people are pulling power from wherever they can get it and at any cost.
On the bright side, at least crypto tried to go the cheapest routes... which happened to be mostly green based power.
Realistically you can also build a server with 8 RTX 3090 and put in in colo in a datacenter. That would technically violate the license, but nobody is going to go after you for that. What Nvidia wants to prevent is hosting providers/IaaS renting out servers with consumer GPUs.
GPU cores are powerful but you need to feed them data at very high speeds and large models require to pass around a lot of data. This is why Nvidia datacenter products have NVLink and Infiniband so Big Corp can run their massive models.
In a recent paper I was reading they were bragging they "only" needed 8 x A100 to train their model.
Renting a 4090 24GB costs around $.40/hr. Renting A100 80GB costs around $3.4/hr. For 3x memory and 2x bandwidth.
To fit a 13B model in a 4090 you need to quantize it. Imagine a 130B or 200B model. And never mind training, which needs a lot more memory.
The divide is crazy. If you are not FAANG you are "GPU-poor".
> Is there really no way to partition the workload to run with 16gb memory per card?
It depends on the model architecture you are using. Once you cannot fit a single instance of your model on a single GPU or at minimum on a single node, things start becoming very complicated. If you are lucky and you have a generic transformer model, you can just use Deepspeed with their transformer kernel. But if you have another architecture it will likely not be compatible with Deepspeed or Fairscale or any of the other scaling frameworks and you will end up having to write your own CUDA kernels.
It is not only about computation, but also the communication bandwidth and memory, especially in training large models. The top consumer GPUs e.g. 3090/4090 still cannot beat H100 in this area even if a lot of techniques are applied.
> Is there really no way to partition the workload to run with 16gb memory per card?
It really depends and this can get really complicated really fast. I'll give a tldr and then a longer explanation.
TLDR:
Yes, you can easily split networks up. If your main bottleneck is batch size (i.e. training) then there aren't huge differences in spreading across multiple GPUs assuming you have good interconnects (GPU direct is supported). If you're running inference and the model fits on the card you're probably fine too unless you need to do things like fancy inference batching (i.e. you have LOTS of users)
Longer version:
You can always split things up. If we think about networks we recognize some nice properties about how they operate as mathematical groups. Non-residual networks are compositional, meaning each layer can be treated as a sub network (every residual block can be treated this way too). Additionally, we may have associative and distributive properties depending on the architecture (some even have commutative!). So we can use these same rules to break apart networks in many different ways. There are often performance hits for doing this though, as it practically requires you touching the disk more often but in some more rare cases (at least to me, let me know if you know more) they can help.
I mentioned the batching above and this can get kinda complicated. There are actually performance differences when you batch in groups of data (i.e. across GPUs) compared to batching on a single accelerator. This difference isn't talked about a lot. But it is going to come down to how often your algorithm depends on batching and what operations are used, such as batch norm. The batch norm is calculated across the GPU's batch, not the distributed batch (unless you introduce blocking). This is because your gradients AND inference are going to be computed differently. In DDP your whole network is cloned across cards so you basically run inference on multiple networks and then do an all reduce on the loss then calculate the gradient and then recopy the weights to all cards. There is even a bigger difference when you use lazy regularization (don't compute gradients for n-minibatches). GANs are notorious for using this and personally I've seen large benefits to distributed training for these. GANs usually have small batch sizes and aren't getting anywhere near the memory of the card anyways (GANs are typically unstable so large batch sizes can harm them), but also pay attention to this when evaluating papers (of course as well as how much hyper-parameter tuning has been done. This is always tricky when comparing works, especially between academia and big labs. You can easily be fooled by which is a better model. Evaluating models is way tougher than people give credit to and especially in the modern era of LLMs. I could rant a lot about just this alone). Basically in short, we can think of this as an ensembling method, except our models are actually identical (you could parallel reduce lazily too and that will create some periodic divergence between your models but that's not important for conceptually understanding, just worth noting).
There is are also techniques to split a single model up called model sharding and checkpointing. Model sharding is where you split a single model across multiple GPUs. You're taking advantage of the compositional property of networks, meaning that as long as there isn't a residual layer between your split location you can actually treat one network as a series of smaller networks. This has obvious drawbacks as you need to feed one into another and so the operations have to be synchronous, but sometimes this isn't too bad. Checkpointing is very similar but you're just doing the same thing on the same GPU. Your hit here is in I/O, but may or may not be too bad with GPU Direct and highly depends on your model size (were you splitting because batch size or because model size?).
This is all still pretty high level but if you want to dig into it more META developed a toolkit called fairseq that will do a lot of this for you and they optimized it