Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have a set of tests that I can run against different models implemented in different languages (e.g. the same tests in Rust, Ts, Python, Swift), and out of these languages, all models have by far the most difficulty with Rust. The scores are notably higher for the same tests in other languages. I'm currently preparing the whole thing for release to share, but its not ready yet because some urgent work-work came up.


Can confirm anecdotally. Even R1 (the full, official version with web search enabled) crashes out hard on my personal Rust benchmark - it refers to multiple items (methods, constants) that don't exist and fails to import basic necessary traits like io::Read. Embarrassing, and does little to challenge my belief that these models will never reliably advance beyond boilerplate.

(My particular test is to ask for an ICMP BPF that does some simple constant comparisons. Correctly implemented, this only takes 6 sock_filters.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: