In computer vision transformers have basically taken over most perception fields. If you look at paperswithcode benchmarks it’s common to find like 10/10 recent winners being transformer based against common CV problems. Note, I’m not talking about VLMs here, just small ViTs with a few million parameters. YOLOs and other CNNs are still hanging around for detection but it’s only a matter of time.
Can it be that transformer-based solutions come from the well-funded organizations that can spend vast amount of money on training expensive (O(n^3)) models?
Are there any papers that compare predictive power against compute needed?
You're onto something. BabyLM competition had caps. Many LLM's were using 1TB training data for some time.
In many cases, I can't even see how many GPU hours or what size cluster of what GPU's the pretraining required. If I can't afford it, then it doesn't matter what it achieved. What I can afford is what I have to choose from.