I'm guessing this model was made to be small simply to keep costs low and make sure that they could feasibly train the model with the amount of time/effort they had. But to some extent I'm left wondering whether this technique would continue to be fruitful when scaled up to a huge model and with a bigger initial training set.
It was made to be small out of necessity. The US government put extensive export controls on many inter-GPU connectivity products last year and expanded those controls recently to include anything above an A100.
Page 9 of this recently published paper[1] is a strong indicator of how far non-US firms go to formally analyze and factor in these bandwidth constraints in building large models.
Ah, that makes sense. Is it possible for those researchers to just rent cloud compute, or is that also prohibited? I suppose that the obvious thought in my mind would be to find some cheap cloud GPU provider and use their platform to do the training. But maybe they're more concerned about inference afterwards, and so that doesn't really solve their issue.