Computer vision seems to be gravitating heavily towards self-attention. While th...

skybrian · on April 6, 2023

Google seems to be doing it all with transformers. It's not open source, though:

> Here we highlight some results of ViT-22B. Note that in the paper we also explore several other problem domains, like video classification, depth estimation, and semantic segmentation.

https://ai.googleblog.com/2023/03/scaling-vision-transformer...

neodypsis · on April 5, 2023

> What's preventing us from taking something like convnext or a hybrid conv/attention model and hooking that up to a decoder? I fee like the results would be similar if not better.

You mean like in an U-Net architecture?

intalentive · on April 6, 2023

IMO optimal visual recognition should be sensorimotor-based and video-first. In the real world, action and perception are intertwined. Supervised training on static pixel arrays seems backward and primitive.

sorenjan · on April 6, 2023

MLP-Mixer uses only multi-layer perceptrons. It was released in 2021, but ReBotNet that was released this year uses mixer layers too.

Still uses patches though, but they're mixing data between patches.