More

XCSme · 2026-04-14T12:06:23 1776168383

Thank you!

One of the worst is TikTok, even as a developer, when someone sends me a TikTok link and I have to visit it, I get stuck in the browser (same with the app but I uninstalled it), and it feels almost device-breaking the way they trap you in.

saagarjha · 2026-04-14T12:22:34 1776169354

TikTok is actually very adamant to boot me out of the browser

XCSme · 2026-04-14T10:29:50 1776162590

Initially I thought this was about their B2 file versions/backups, where they keep older versions of your files.

Hamuko · 2026-04-14T11:01:55 1776164515

B2 is not a backup service. It’s an object storage service.

xmcp123 · 2026-04-14T13:40:00 1776174000

Weird, because in the Reddit thread linked above they call themselves a backup service.

XCSme · 2026-04-14T23:51:52 1776210712

I guess you were as confused as me, as I only asociate BackBlaze with B2, I haven't used any other of their services.

XCSme · 2026-04-13T14:13:43 1776089623

It just describes what's in the photo and then some completely wrong/random facts about self-esteem, income, religion, etc.

XCSme · 2026-04-13T13:58:49 1776088729

I guess writing code is now like creating punch-cards for old computers. Or even more recently, as writing ASM instead of using a higher level language like C. Now we simply write our "code" in a higher language, natural language, and the LLM is the compiler.

bilekas · 2026-04-13T14:06:32 1776089192

> Now we simply write our "code" in a higher language, natural language, and the LLM is the compiler.

No we don't and we never should actually, compilers need to be deterministic.

SkyBelow · 2026-04-13T14:22:13 1776090133

It needs to be something stronger than just deterministic.

With the right settings, a LLM is deterministic. But even then, small variations in input can cause very unforeseen changes in output, sometimes drastic, sometimes minor. Knowing that I'm likely misusing the vocabulary, I would go with saying that this counts as the output being chaotic so we need compilers to be non-chaotic (and deterministic, I think you might be able to have something that is non-deterministic and non-chaotic). I'm not sure that a non-chaotic LLM could ever exist.

(Thinking on it a bit more, there are some esoteric languages that might be chaotic, so this might be more difficult to pin down than I thought.)

Farox · 2026-04-13T15:30:23 1776094223

Why?

Also, give the same programming task to 2 devs and you end up with 2 different solutions. Heck, have the same dev do the same thing twice and you will have 2 different ones.

Determinism seems like this big gotcha, but in it self, is it really?

bilekas · 2026-04-13T15:35:29 1776094529

> Heck, have the same dev do the same thing twice and you will have 2 different ones

"Do the same thing" I need to be pedantic here because if they do the same thing, the exact same solution will be produced.

The compiler needs to guarantee that across multiple systems. How would QA know they're testing the version that is staged to be pushed to prod if you can't guarantee it's the same ?

acedTrex · 2026-04-13T14:14:12 1776089652

This is not what a compiler is in any sense.

TheRoque · 2026-04-13T14:02:19 1776088939

I cringe every time I read this "punch card" narrative. We are not at this stage at all. You are comparing deterministic stuff and LLMs which are not deterministic and may or may not give you what you want. In fact I personally barely use autonomous Agents in my brownfield codebase because they generate so much unmaintainable slop.

bigfishrunning · 2026-04-13T14:02:58 1776088978

Except that compiler is a non-deterministic pull of a slot-machine handle. No thanks, I'll keep my programming skills; COBOL programmers command a huge salary in 2026, soon all competent programmers will.

XCSme · 2026-04-13T01:35:32 1776044132

Releasing version 9.0 of my self-hosted analytics app[0]. I will finally add an in-app cron job editor, so you can easily schedule clean-up jobs, data retention settings, newsletters/summaries, etc.

[0]: https://www.uxwizz.com

XCSme · 2026-04-08T00:12:51 1775607171

General intelligence (not coding) comparison: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

BoorishBears · 2026-04-08T04:19:53 1775621993

Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?

XCSme · 2026-04-08T08:05:47 1775635547

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.

jaggs · 2026-04-08T10:10:16 1775643016

It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

BoorishBears · 2026-04-08T18:16:58 1775672218

This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???

jaggs · 2026-04-08T19:03:23 1775675003

Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

BoorishBears · 2026-04-09T08:10:19 1775722219

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.

XCSme · 2026-04-08T00:10:17 1775607017

GLM 5.1 does worse than GLM 5 in my tests[0] (both medium reasoning OR no reasoning).

I think the model is now tuned more towards agentic use/coding than general intelligence.

[0]: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

XCSme · 2026-04-08T00:11:10 1775607070

The (none) version especially shows considerable degradation.

XCSme · 2026-04-05T21:58:57 1775426337

Gemma 4 is great: https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...

I assume it is the 26B A4B one, if it runs locally?

adrian17 · 2026-04-05T22:24:17 1775427857

No, only E2B and E4B.

XCSme · 2026-04-05T21:53:55 1775426035

I tried using Astro for https://aibenchy.com, initially it went great, but then I got into static-website limitations (such as dynamically generating all comparison pages, which would been generating N^4 pages, where N is the number of tested models).

I ended up switching to plain PHP, and it worked great. It is still mostly "static", but I can dynamically include the same content on multiple pages without having to duplicate/build it every time.

XCSme · 2026-04-02T22:29:31 1775168971

It does quite well on my limited/not-so-scientific private tests (note the tests don't include coding tests): https://aibenchy.com/compare/google-gemma-4-31b-it-medium/go...