I have a set of tests that I can run against different models implemented in dif...

colonial · on Jan 29, 2025

Can confirm anecdotally. Even R1 (the full, official version with web search enabled) crashes out hard on my personal Rust benchmark - it refers to multiple items (methods, constants) that don't exist and fails to import basic necessary traits like io::Read. Embarrassing, and does little to challenge my belief that these models will never reliably advance beyond boilerplate.

(My particular test is to ask for an ICMP BPF that does some simple constant comparisons. Correctly implemented, this only takes 6 sock_filters.)