Cool paper. It's more independent than dense or normal MoE but I think it's still far away from the distributed training you're looking for because you still need a seed LM which is trained normally and when fine-tuning each expert from the seed LM, you still need enough GPUs or VRAM to fine-tune that LLM so you're still limited to large GPU clusters which is the problem we're trying to avoid.
In the case of the paper, they are using OPT-6.7b as the seed LM which requires 8xV100 GPUs for fine-tuning each expert. That's a combined total of 256GB of VRAM for a single expert while the 3090 only has 24GB of VRAM and is still one of the most expensive GPUs out there.
Maybe we could use something like PEFT or QLoRA in combination with this technique to make each expert small enough for the community to fine-tune and make a worse Mixtral 8x7b, but I don't know enough to say for sure.
Or maybe it turns out we can make a good MoE model with thousands of smaller experts. Experts small enough for a separate member of the community to independently fine-tune on a normal GPU, but idk.
To have both a performant and distributed LLM trained from scratch, we still need a completely different architecture to do it, but this work is pretty cool and may mean that if nothing else, there is something the community can do to help move things forward.
Also, I was going to say the MoE routing on this technique was lacking, but I found a more recent paper[0] by Meta which fixes this with a final fine-tuning stage.
Base model was still trained in usual, non distributed way (by far the most cost).
Fine tunes were also trained in usual, non distributed way.
Proposed approach tries out several combinations to pick one that seems to perform better (where combination means ie. adhoc per layer operation).
Merging is not distributed as well.
There is not much distribution happening overall beyond the fact that fine tunes were trained independently.
Taking weight averages, weighted weight averages, trimming low diffs, doing arithmetic (subtracting base model from fine tune) etc. are all ad hoc trials throwing something on the wall and seeing what sticks the most. None of those work well.
For distributed training to work we'd have to have better algebra around this multidimentional/multilayer/multiconnectivity state. We don't have it and it has many problems, ie. evaluation is way too expensive. But solving "no need to rerun through whole training/benchmark corpus to see if my tiny change is better or not" problem will mean we solved problem of extracting essence of intelligence. If we do that, then hyper-efficient data centers will still keep beating out any distributed approach and it's all largely irrelevant because that's pure AGI already.
https://arxiv.org/abs/2303.14177