Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you. I think the paper as it is provides enough evidence to support the claims. If I understand the authors correctly, they trained the compared models on only 100B tokens, all drawn from RedPajama, to make the comparisons apples-to-apples. That's sensible. It allows for easier replication of the results. Otherwise, I agree with you that more extensive testing, after more extensive pretraining, is still necessary.


And that's true, but why do they limit it to 100B tokens? And why not provide the loss curves in the end to show that both models have converged? What's not proven to me, in this paper, is the ability of the model to scale and generalize to bigger datasets. It's easy to see how a model of sufficient size can overcome the quantization bottleneck, when trained on such a small dataset. Which is perhaps why smaller variations failed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: