Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any tips on coming up with good private evals?


Yes, I wrote something up here on how Andrei Kaparthy evaluated grok 3 -> https://tomhipwell.co/blog/karpathy_s_vibes_check/

I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: