Load just makes LLMs behave less deterministically and likely degrade. See: http...

bgirard · 2026-01-29T17:01:56 1769706116

> malicious

It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.

I care about -expected- performance when picking which model to use, not optimal benchmark performance.

Aurornis · 2026-01-29T17:38:58 1769708338

Non-determinism isn’t the same as degradation.

The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls.

In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result.

bonoboTP · 2026-01-29T20:05:22 1769717122

This has nothing to do with overloading. The suspicion is that when there is too much demand (or they just want to save costs), Anthropic sometimes uses a less capable (quantized, distilled, etc) version of the model. People want to measure this so there is concrete evidence instead of hunches and feelings.

To say that this measurement is bad because the server might just be overloaded completely misses the point. The point is to see if the model sometimes silently performs worse. If I get a response from "Opus", I want a response from Opus. Or at least want to be told that I'm getting slightly-dumber-Opus this hour because the server load is too much.

F7F7F7 · 2026-01-29T22:54:37 1769727277

“Just drink the water, it’s all water.”

novaleaf · 2026-01-29T17:41:10 1769708470

this is about variance of daily statistics, so I think the suggestions are entirely appropriate in this context.

strongpigeon · 2026-01-29T18:32:44 1769711564

The question I have now after reading this paper (which was really insightful) is do the models really get worse under load, or do they just have a higher variance? It seems like the latter is what we should expect, not it getting worse, but absent load data we can't really know.

altcognito · 2026-01-29T17:08:31 1769706511

Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.

minimaltom · 2026-01-29T18:06:16 1769709976

Its not deterministic. Any individual floating point mul/add is deterministic, but in a GPU these are all happening in parallel and the accumulation is in the order they happen to complete.

When you add A then B then C, you get a different answer than C then A then B, because floating point, approximation error, subnormals etc.

bonoboTP · 2026-01-29T22:29:58 1769725798

It can be made deterministic. It's not trivial and can slow it down a bit (not much) but there are environment variables you can set to make your GPU computations bitwise reproducible. I have done this in training models with Pytorch.

minimaltom · 2026-01-29T23:47:56 1769730476

There are settings to make it reproducible but they incur a non-negligible drop in performance.

Unsurprising given they amount to explicit synchronization to make the order of operations deterministic.

chrisjj · 2026-01-29T17:43:52 1769708632

Not deterministic. https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

jmalicki · 2026-01-29T19:51:38 1769716298

For all practical purposes any code reliant on the output of a PRNG is non-deterministic in all but the most pedantic senses... And if the LLM temperature isn't set to 0 LLMs are sampling from a distribution.

If you're going to call a PRNG deterministic then the outcome of a complicated concurrent system with no guaranteed ordering is going to be deterministic too!

gmueckl · 2026-01-29T20:08:55 1769717335

No, this isn't right. There are totally legitimate use cases for PRNGs as sources of random number sequences following a certain probability distribution where freezing the seed and getting reproducibility is actually required.

jmalicki · 2026-01-29T21:23:45 1769721825

And for a complicated concurrent system you can also replay the exact timings and orderings as well!

gmueckl · 2026-01-30T09:35:12 1769765712

That's completely different from PRNGs. I don't understand why you think those things belong together.

bonoboTP · 2026-01-29T20:08:31 1769717311

How is this related to overloading? The nondeterminism should not be a function of overloading. It should just time out or reply slower. It will only be dumber if it gets rerouted to a dumber, faster model eg quantized.

joquarky · 2026-01-30T01:23:22 1769736202

Temperature can't be literally zero, or it creates a divide by zero error.

When people say zero, it is shorthand for “as deterministic as this system allows”, but it's still not completely deterministic.

forgotTheLast · 2026-01-30T03:22:17 1769743337

Zero temp just uses argmax, which is what softmax approaches if you take the limit of T to zero anyway. So it could very well be deterministic.

pertymcpert · 2026-01-29T17:44:15 1769708655

Floating point math isn't associative for operations that are associative in normal math.

measurablefunc · 2026-01-29T18:11:22 1769710282

That would just add up to statistical noise instead of 10% degradation over a week.

kevin_thibedeau · 2026-01-29T18:56:17 1769712977

Catastrophic error accumulation can produce more profound effects than noise.

measurablefunc · 2026-01-29T20:33:43 1769718823

Just to make sure I got this right. They serve millions of requests a day & somehow catastrophic error accumulation is what is causing the 10% degradation & no one at Anthropic is noticing it. Is that the theory?

pertymcpert · 2026-02-09T07:14:35 1770621275

FYI something in that region happened last august/September. Some inference bug triggered worse performance on TPUs vs GPU.

make3 · 2026-01-29T22:17:59 1769725079

There's a million algorithms to make LLM inference more efficient as a tradeoff for performance, like using a smaller model, using quantized models, using speculative decoding with a more permissive rejection threshold, etc etc

FL33TW00D · 2026-01-29T17:39:42 1769708382

It takes a different code path for efficiency.

e.g

if (batch_size > 1024): kernel_x else: kernel_y

stefan_ · 2026-01-29T18:33:54 1769711634

The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism.

I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.

hatmanstack · 2026-01-29T19:39:59 1769715599

That's why I'd love to get stats on load/hardware/location of where my inference is running. Looking at you Trainiuim.

bonoboTP · 2026-01-30T11:54:56 1769774096

Why do you think batching has anything to do with the model getting dumber? Do you know what batching means?

stefan_ · 2026-01-30T18:51:52 1769799112

Well if you were to read the link you might just find out! Today is your chance to be less dumb than the model!

bonoboTP · 2026-01-30T20:24:49 1769804689

I checked the link, it never says that the model's prediction get lower quality due to batching, just nondeterministic. I don't understand why people conflate these things. Also it's unlikely that they use smaller batch sizes when load is lower. They just likely spin up and down GPU serves based on demand, or more likely, reallocate servers and gpus between different roles and tasks.

make3 · 2026-01-29T22:15:51 1769724951

It's very clearly a cost tradeoff that they control and that should be measured.