Zero-3 Offload: Scale DL models to trillion parameters without code changes

FL33TW00D · on March 13, 2021

Huggingface has been working on implementing this into their library, and it has some pretty amazing effects on the size of models you can train on a simple Colab.

https://huggingface.co/blog/zero-deepspeed-fairscale

stephenroller · on March 13, 2021

Support for this was also added to [Fairscale](https://fairscale.readthedocs.io/en/latest/) and [Fairseq](https://github.com/pytorch/fairseq) last week. In particular, the Fairscale implementation can be used in any pyotrch project without requiring the use of the Deepspeed trainer.

diptanu · on March 13, 2021

What are the relevant commits in Fairseq for this? I couldn't figure out the changes by looking at the commits from last week.

stephenroller · on March 14, 2021

https://github.com/pytorch/fairseq/pull/3331 and https://github.com/pytorch/fairseq/pull/3327

ansk · on March 13, 2021

Question for someone knowledgable about this: if I have a model which is large -- but small enough that I can fit a single training example on GPU -- does this approach offer speedups compared to simple gradient accumulation? Or is this only useful for models which are so large that the model parameters themselves are overwhelming GPU memory?

joshlk · on March 13, 2021

GPT-NeoX is an example project that is using deepspeed and Zero-3 offloading. The wider project intend to train a GPT-3 sized model and release it freely to the world.

https://github.com/EleutherAI/gpt-neox

ma2rten · on March 13, 2021

It seems like Zero-3 doesn't work for them:

https://github.com/EleutherAI/gpt-neox/issues/171

stellaathena · on March 13, 2021

Hi! I’m the one who wrote this code. My ZeRO-3 implementation is currently not working, but I’ve spoken with DeepSpeed devs and they’ve explained to me what I’ve been doing wrong. I haven’t had time to implement the fix but I don’t see any reason to assume it won’t work.

https://github.com/microsoft/DeepSpeed/issues/846

Also, the specific problem described in that Issue was due to a bug I found in DeepSpeed that has since been corrected.

joshlk · on March 13, 2021

Looks like they got it working recently https://github.com/EleutherAI/gpt-neox/pull/178

dqpb · on March 13, 2021

Did you even read through the issue? I don't see anything that indicates it won't work.

ma2rten · on March 13, 2021

Yes, I did. The last comment is a traceback and an explanation what would have to be done to fix it.

minimaxir · on March 13, 2021

Your comment implied it's not possible at all for them to use it, not that it's currently not working.

ma2rten · on March 14, 2021

I guess an ambiguity in the English language is at fault here. I meant it literally doesn't work for them, and not the idiom.

dataangel · on March 13, 2021

ELI5? All this techno babble just sounds like "it's faster because we optimized it". What are the nontrivial, new fundamental tricks?

jonbaer · on March 13, 2021

I think there is some explanation (on the previous model?) here, https://www.youtube.com/watch?v=tC01FRB0M7w

jiofih · on March 13, 2021

Third paragraph or so in the overview:

> ZeRO removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency

dataangel · on March 13, 2021

Yeah that would be the techno-babble. I've been working on a machine learning pipeline for 6 years and I still have no idea what this means.

liuliu · on March 13, 2021

It is mostly applicable to transformer models, the ideas in the paper would be alien if you work on computer vision.

In transformer models, big chunk of memory was parameters, and states for optimizers (because vanilla SGD not used there). The memory optimization technique that removes parameters duplication on each GPU or offload entirely to CPU makes sense.

In computer vision, big chunk of memory was hold by forward layer activations and the memory optimization technique applicable in these cases would be binomial checkpointing.

zachthewf · on March 13, 2021

It doesn't sound like techno-babble to me. They've distributed storage across nodes rather than replicating on each node, hence the model size is now scalable with number of nodes rather than being limited to what could be stored on a single node.

p1esk · on March 13, 2021

But it's not clear how they managed to improve training on a single GPU: they say they can fit 40B model on a single V100.

liuliu · on March 13, 2021

They offload parameters, gradients and optimizer states (such as moment, velocity and exponential avg of these in Adam) into CPU memory.

p1esk · on March 13, 2021

They did all that before: https://arxiv.org/abs/2101.06840, but they could only fit a model with 13B weights on a single V100.

leogao · on March 13, 2021

I have a reasonable amount of experience with distributed machine learning (and transformers in particular, too) and I have to 100% agree that this blog post (and even the ZeRO paper) is largely technobabble. I don't doubt that this might really work, but how it works is not elucidated very well, and I'm still not 100% sure I understand what they actually did.

For anyone who still thinks the blog post has substance: Saying that they partitioned the optimizer state, params, etc to have no redundancies is kind of "duh," sort of like saying "we solved the problem using coding and algorithms." It's obvious that we want to eliminate redundancies to maximize the effective VRAM; it's not like nobody thought of not having redundancies before. The problem is that in general, training models distributed is a weird balancing act between redundancy, network usage, compute, etc. The existing methods, model/pipeline/data parallel, gradient checkpoint/accum, etc all have their pros and cons. Unless ZeRO3 is doing something crazy, it has to be giving something up to get to zero redundancy, and knowing what that is would be very important.

If someone could ELI5 how ZeRO actually works, that would be nice.

BadInformatics · on March 14, 2021

Having followed this DeepSpeed stuff for a little while, the ZeRO paper is probably as close as you can get to an ELI5 because there's no singular brilliant idea behind this. Most of the ideas have been explored already (see e.g. the PyTorch DDP paper), but ZeRO takes them to their logical conclusion by throwing a TON of engineering work into the equation. For example, they implement custom fused kernels on CPU/GPU and a hand-vectorized Adam implementation.

I found that this earlier blog post [2] has a much better deep dive (with decent animations and more) into the underlying architecture. The ZeRO-Offload paper [3] also has far more detail about that part of the pipeline.

[1] https://arxiv.org/abs/2006.15704 [2] https://www.microsoft.com/en-us/research/blog/deepspeed-extr... [3] https://arxiv.org/abs/2101.06840

leogao · on March 14, 2021

My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff. That seems to imply that ZeRO is model+pipeline parallel plus a bunch of miscellaneous bits. But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.

BadInformatics · on March 14, 2021

> My impression from reading the paper is most of the other optimizations (custom kernels, contiguous memory, checkpointing, etc) are orthogonal to the partitioning stuff

This is true, I include them as examples of the amount of engineering work involved because using the partitioning as an example would require recapitulating their blog post :)

> But they seem to emphasize that this isn't what their partitioning is, and that's the part that perplexes me the most. To be specific, I'd like someone to explain how their magical zero-redundancy data parallel (termed ZeRO-DP in the paper) works and how it's different from from model+pipeline parallel, and their paper is awfully sparse on that.

Again, https://www.microsoft.com/en-us/research/blog/deepspeed-extr... is a much better resource on this. There really isn't any magic going on, nor are many of these ideas (checkpointing, model state sharding, bucketing, JIT communication of new states interleaved with compute, etc.) new when considered in isolation. ZeRO is data + model + pipeline parallel, but optimized to the nines and actually usable as a production library.

jiofih · on March 13, 2021

You can read the paper here: https://arxiv.org/abs/1910.02054

eugenhotaj · on March 13, 2021

If your pipeline uses only “classic” ml models, then this won’t make too much sense. It’s mostly applicable to NNs.

cambalache · on March 13, 2021

The product is obviously not for you but for clueless PHBs who want the "latest and best" for the team so those useless ML engineers can finally put his brilliant idea in production with a less than 1% prediction error.

bevenky · on March 13, 2021

This is also being added to pytorch

https://github.com/pytorch/pytorch/pull/46750

minimaxir · on March 13, 2021

I don't think that's the Stage 3 announced in this blog post, but it's def a framework for it.

alphagrep12345 · on March 13, 2021

Simple 10 min overview/tutorial (official) if someone is interested - https://www.youtube.com/watch?v=ovQC7FqXHXk

The_rationalist · on March 13, 2021

See also zeroth order backpropagation which allows 300X faster training while not reducing throughput that much https://arxiv.org/abs/2011.08895 How much zero-3 affect accuracy?

See also https://github.com/microsoft/fastformers

vladf · on March 13, 2021

Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.

I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.

gwern · on March 13, 2021

SGD doesn't work on large Transformers, no. You need something like AdamW.

The_rationalist · on March 13, 2021

Mish is generally superior to RadamW https://lessw.medium.com/meet-mish-new-state-of-the-art-ai-a...

dron57 · on March 14, 2021

That's cool, but Mish is an activation function while SGD and AdamW are optimizers. Apples and oranges.

andrewprock · on March 13, 2021

How much data do you need to mitigate the risk of over fitting a trillion parameter model?

gwern · on March 13, 2021

You ideally need ~500GB of text, or so. EleutherAI's The Pile was designed to be just big enough to fit a 1t GPT efficiently, and you can get the various scaling curves out of the OA-related scaling papers. (You want the amount of data that fits into a single epoch, because if you reuse data, you get less bang for the FLOPs buck, and FLOPS constraints are right now much more binding than data or model size.)

andrewprock · on March 13, 2021

This feels off by a couple of orders of magnitude, unless a significant number of the parameters are not independent.

singhrac · on March 13, 2021

Well, that's the "magic" of modern deep learning. You can fit models with p > n somehow without overfitting. In some areas you might find this called "the strong inductive bias of neural networks" or "double descent" but no one has found a convincing explanation (to me).

gwern · on March 13, 2021

It's quite amusing. The standard statistical theory does not work at all in estimating data vs model size, and the bounds are all vacuously large. It's a very active area of research, understanding why models act so simple when overparameterized and coming up with real measures of model complexity. Lots to read there if you are interested in such things.

andrewprock · on March 14, 2021

That just means that the parameters are not independent.

gwern · on March 14, 2021

But you can fit randomly-generated labels!

andrewprock · on March 16, 2021

That's not in any way surprising. When you have more parameters than data, this is trivial.

singhrac · on March 13, 2021

For those searching, DeepSpeed is implemented as a set of C++/CUDA extensions on top of PyTorch (compiled using their JIT).

bionhoward · on March 13, 2021

please hook this up to Jax!

mchusma · on March 13, 2021

This is super impressive. I could not figure out for a while who exactly was running this project, but it looks like its Microsoft. Great work!