Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Computer vision seems to be gravitating heavily towards self-attention. While the results here are impressive, I'm not quite convinced that vision encoders are the right way forward. I just can't wrap my head around how discretizing images, which are continuous in two dimensions, into patches is the most optimal way to do visual recognition.

What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder stack? I fee like the results would be similar if not better.

EDIT: Clarifying that encoder/decoder refers to the transformer stack, not an autoencoder.



Google seems to be doing it all with transformers. It's not open source, though:

> Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.

https://ai.googleblog.com/2023/03/scaling-vision-transformer...


> What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder? I fee like the results would be similar if not better.

You mean like in an U-Net architecture?


IMO optimal visual recognition should be sensorimotor-based and video-first. In the real world, action and perception are intertwined. Supervised training on static pixel arrays seems backward and primitive.


MLP-Mixer uses only multi-layer perceptrons. It was released in 2021, but ReBotNet that was released this year uses mixer layers too.

Still uses patches though, but they're mixing data between patches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: