One of the worst is TikTok, even as a developer, when someone sends me a TikTok link and I have to visit it, I get stuck in the browser (same with the app but I uninstalled it), and it feels almost device-breaking the way they trap you in.
I guess writing code is now like creating punch-cards for old computers. Or even more recently, as writing ASM instead of using a higher level language like C. Now we simply write our "code" in a higher language, natural language, and the LLM is the compiler.
It needs to be something stronger than just deterministic.
With the right settings, a LLM is deterministic. But even then, small variations in input can cause very unforeseen changes in output, sometimes drastic, sometimes minor. Knowing that I'm likely misusing the vocabulary, I would go with saying that this counts as the output being chaotic so we need compilers to be non-chaotic (and deterministic, I think you might be able to have something that is non-deterministic and non-chaotic). I'm not sure that a non-chaotic LLM could ever exist.
(Thinking on it a bit more, there are some esoteric languages that might be chaotic, so this might be more difficult to pin down than I thought.)
Also, give the same programming task to 2 devs and you end up with 2 different solutions. Heck, have the same dev do the same thing twice and you will have 2 different ones.
Determinism seems like this big gotcha, but in it self, is it really?
> Heck, have the same dev do the same thing twice and you will have 2 different ones
"Do the same thing" I need to be pedantic here because if they do the same thing, the exact same solution will be produced.
The compiler needs to guarantee that across multiple systems. How would QA know they're testing the version that is staged to be pushed to prod if you can't guarantee it's the same ?
I cringe every time I read this "punch card" narrative. We are not at this stage at all. You are comparing deterministic stuff and LLMs which are not deterministic and may or may not give you what you want. In fact I personally barely use autonomous Agents in my brownfield codebase because they generate so much unmaintainable slop.
Except that compiler is a non-deterministic pull of a slot-machine handle. No thanks, I'll keep my programming skills; COBOL programmers command a huge salary in 2026, soon all competent programmers will.
Releasing version 9.0 of my self-hosted analytics app[0]. I will finally add an in-app cron job editor, so you can easily schedule clean-up jobs, data retention settings, newsletters/summaries, etc.
I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.
Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.
Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.
Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.
I tried using Astro for https://aibenchy.com, initially it went great, but then I got into static-website limitations (such as dynamically generating all comparison pages, which would been generating N^4 pages, where N is the number of tested models).
I ended up switching to plain PHP, and it worked great. It is still mostly "static", but I can dynamically include the same content on multiple pages without having to duplicate/build it every time.
One of the worst is TikTok, even as a developer, when someone sends me a TikTok link and I have to visit it, I get stuck in the browser (same with the app but I uninstalled it), and it feels almost device-breaking the way they trap you in.
reply