I built and trained a BERT on my gaming laptop (3070 RTX) to ~94% of BERT-base's performance in ~17 hours* (BERT-base was trained on 4 TPUs for 4 days). This notebook goes over the whole process, from implementing and training a tokenizer, to pretraining, to finetuning. One feature that makes this BERT different from most (though not unique) is the use of relative position embeddings.
Edit - for anyone unsure about what "BERT" is or its relevance, it's a transformer based natural language model just like GPT. However, where GPT is used to generate text, BERT is used to generate embeddings for input text that you can then use for predictive models (e.g. sentiment prediction), and that process is also demonstrated in the notebook.
*Edit 2 - The 17 hours are pretraining only, not including the time to train the tokenizer, or finetuning.
> it's a transformer based natural language model just like GPT
its an encoder-decoder model whereas GPT is decoder only. feels like a pretty big difference, though in practice i honestly still dont have a strong grasp of how encoder-decoder is deficient to decoder-only when it comes to text generation. i get that BERT was designed for translation but why cant we scale it up and use it for textgen just the same
BERT is encoder only and was designed for classification and natural language inference problems. The original Transformer was encoder-decoder and was designed for translation.
BERT can't be used in an autoregressive way because it doesn't output a new token, it simply generates embeddings from the existing tokens (you get one for each input token).
> where GPT is used to generate text, BERT is used to generate embeddings for input text that you can then use for predictive models (e.g. sentiment prediction)
Ok, but isn't text generation more general? E.g. you could ask it to predict the sentiment of a sentence and write the result as a sentence?
Yeah my explanation was definitely a lossy summary. You can do similar things with GPT, but BERT is bidirectional, so for a given token it can take into account both tokens before and after it. GPT would only take into account tokens before it. Looking both ways can be helpful. Another comment in this thread explains the same (maybe clearer).
Yeah, he's glossing over some things, but with good reason.
Might be more accurate to say BERT is a discriminant model, while GPT is a generative model. BERT was trained using Masked Language Model process, which is different from the decoder-only process used for the first GPT. Sentiment prediction seems like more of a particular thing BERT is capable of. There are many more capabilities, but GPT has sort of steered the industry towards generative models.
https://huggingface.co/docs/transformers/main/tasks/masked_l...
GPT and BERT were actually the first models published after Attention was published by Google.
Haha fair question. I didn't make any special changes, I just left the lid open and put the laptop in a ventilated spot. I'm actually in the tropics, so I guess Lenovo scores some points here (the laptop is a Legion 5 Pro).
If you want to really cook your lap, try running Gentoo! Emerging (compiling) Firefox, glibc, gcc, Libre office and a few of their friends will soon show you how good the cooling is.
A few years back cough I upgraded gcc from 3 to 4 and emerged system and then world. That was over 1200 packages. It took about a week. That was in the days when I used a Windows wifi driver and some unholy magic to get a connection. I parked the laptop on a table with the lid open and two metal rods lifting it up 6" for airflow. I left the nearby window open a bit too.
Going off on a tangent: I used to use Gentoo in the past. I suspect, if you use one of the common processors, compiling your own binaries doesn't really give you any performance benefits, does it?
(I have to admit I stopped using Gentoo mostly because it encouraged me to endlessly fiddle with my system, and it would invariably end up broken somehow. That's entirely my fault, and not Gentoo's. I switched to Archlinux as my distribution of choice, and I manage to hold myself back enough not to destroy my installation.)
Quite a bit of performance. Generally Linux distro (and really all general purpose OS's) need to limit the CPU features they can to a minimum common baseline. Your machine support AVX512 instructions? Those instructions won't be used by the compiler because it's not available everywhere the software will run. By compiling yourself, you can specialize the compilation to the features on the machine.
Beyond that the big win over performance even is customization. The most secure code, and the fastest code, is the code that isn't there at all. Do you really need your entire system to support LDAP authentication? Maybe... What about your local email daemon? Do you need that? Because cron does and since your mail daemon also has MySQL support built in, installing cron gets you the MySQL libraries.
I don't use it anymore because of the overhead, but there are a lot of performance and security benefits to be had there.
> Realistically, though, most software that can benefit from specialized instructions already detects their availability at runtime and uses those, even if the code was compiled with -march=x86-64.
It's all behind the submission link! I've set it up so that you can run it start to end, if you want. The only thing I'm not 100% sure about is resource requirements - I have an 8GB GPU and 32GB of RAM, it could be that if you have less than that you'd run into out of memory errors. Those would be fairly straightforward to fix, though (honestly I'd be happy to help if someone runs into this).
Edit - for anyone unsure about what "BERT" is or its relevance, it's a transformer based natural language model just like GPT. However, where GPT is used to generate text, BERT is used to generate embeddings for input text that you can then use for predictive models (e.g. sentiment prediction), and that process is also demonstrated in the notebook.
*Edit 2 - The 17 hours are pretraining only, not including the time to train the tokenizer, or finetuning.