> it's a transformer based natural language model just like GPT
its an encoder-decoder model whereas GPT is decoder only. feels like a pretty big difference, though in practice i honestly still dont have a strong grasp of how encoder-decoder is deficient to decoder-only when it comes to text generation. i get that BERT was designed for translation but why cant we scale it up and use it for textgen just the same
BERT is encoder only and was designed for classification and natural language inference problems. The original Transformer was encoder-decoder and was designed for translation.
BERT can't be used in an autoregressive way because it doesn't output a new token, it simply generates embeddings from the existing tokens (you get one for each input token).
its an encoder-decoder model whereas GPT is decoder only. feels like a pretty big difference, though in practice i honestly still dont have a strong grasp of how encoder-decoder is deficient to decoder-only when it comes to text generation. i get that BERT was designed for translation but why cant we scale it up and use it for textgen just the same