Google seems to be doing it all with transformers. It's not open source, though:
> Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.
> Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.
https://ai.googleblog.com/2023/03/scaling-vision-transformer...