i dunno, Opus is losing it's edge imo. i regularly use a mix of models, including Opus, glm 5.1, kimi 2.6, etc. and i find that all of them are pretty much equally good at "average" coding, but on difficult stuff they're nearly equally bad. i can't deny that Opus has an edge, but it's not a huge one.
I won't quibble even though I likely should. Have to remember this is HN and companies need to shill their work otherwise ... Yes.
I will play along and assume this is sound. 10-40% +/- 10% is along the lines of "sort of" in a completely unreliable, unguaranteed and unproven way sure.
i don't buy this. distilled how? you don't get access to logprobs, and the thinking traces are fake and compressed. it's an expensive way to get potentially substandard training data.
better than Opus? not even close. after struggling thru server overload for the past couple hours i finally put 5.1 thru the paces and it's....okay. failed some simple stuff that Sonnet/Opus/Gemini didn't. failed it badly and repeatedly actually. this was in typescript, btw. not sure if i'll keep the subscription or not
after you go from from millions of params to billions+ models start to get weird (depending on training) just look at any number of interpretability research papers. Anthropic has some good ones.
reply