Are transformers not already very specialized to the task of learning from sequences of word vectors? I'm sure there is more that can be done with them other than making the input sequences really long, but my point was that LLMs are hardly lacking in design specialized to their purpose.
> Are transformers not already very specialized to the task of learning from sequences of word vectors?
No, you can use transformers for vision, image generation, audio generation/recognition, etc.
They are 'specialized' in that they are for working with sequences of data, but almost everything can be nicely encoded as a sequence. In order to input images, for example, you typically split the image into blocks and then use a CNN to produce a token for each block. Then you concatenate the tokens and feed them into a transformer.
That's fair, and I didn't realize people were using them for images that way. I would still argue that they are at least somewhat more specialized than plain fully-connected layers, much like convolutional layers.
It is definitely interesting that we can do so much with a relatively small number of generic "primitive" components in these big models. but I suppose that's part of the point.