So is this GPT for images? They take a generative model and apply finetuning via...

So is this GPT for images? They take a generative model and apply finetuning via LoRA on some downstream task such as surface normal and conclude that these models intrinsically learn these representations, and find that they do better than supervised approaches?

I think this is awesome but maybe it’s not really that surprising given how this “generate and then finetune” approach has already worked so well?