Computer vision seems to be gravitating heavily towards self-attention. While the results here are impressive, I'm not quite convinced that vision encoders are the right way forward. I just can't wrap my head around how discretizing images, which are continuous in two dimensions, into patches is the most optimal way to do visual recognition.
What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder stack? I fee like the results would be similar if not better.
EDIT: Clarifying that encoder/decoder refers to the transformer stack, not an autoencoder.
Google seems to be doing it all with transformers. It's not open source, though:
> Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.
> What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder? I fee like the results would be similar if not better.
IMO optimal visual recognition should be sensorimotor-based and video-first. In the real world, action and perception are intertwined. Supervised training on static pixel arrays seems backward and primitive.
What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder stack? I fee like the results would be similar if not better.
EDIT: Clarifying that encoder/decoder refers to the transformer stack, not an autoencoder.