There is hidden state as plain as day merely in the fact that logits for token p...

gpm · 2025-07-07T01:44:29 1751852669

The LLM does not "have" a plan.

Arguably there's reason to believe it comes up with a plan when it is computing token propabilities, but it does not store it between tokens. I.e. it doesn't possess or "have" it. It simply comes up with a plan, emits a token, and entirely throws all its intermediate thoughts (including any plan) to start again from scratch on the next token.

barrkel · 2025-07-07T09:41:30 1751881290

I believe saying the LLM has a plan is a useful anthropomorphism for the fact that it does have hidden state that predicts future tokens, and this state conditions the tokens it produces earlier in the stream.

godshatter · 2025-07-07T16:50:01 1751907001

Are the devs behind the models adding their own state somehow? Do they have code that figures out a plan and use the LLM on pieces of it and stitch them together? If they do, then there is a plan, it's just not output from a magical black box. Unless they are using a neural net to figure out what the plan should be first, I guess.

I know nothing about how things work at that level, so these might not even be reasonable questions.

yorwba · 2025-07-07T09:11:34 1751879494

It's true that the last layer's output for a given input token only affects the corresponding output token and is discarded afterwards. But the penultimate layer's output affects the computation of the last layer for all future tokens, so it is not discarded, but stored (in the KV cache). Similarly for the antepenultimate layer affecting the penultimate layer and so on.

So there's plenty of space in intermediate layers to store a plan between tokens without starting from scratch every time.

NiloCK · 2025-07-07T01:53:20 1751853200

I don't think that the comment above you made any suggestion that the plan is persisted between token generations. I'm pretty sure you described exactly what they intended.

gugagore · 2025-07-07T10:36:25 1751884585

The concept of "state" conveys two related ideas.

- the sufficient amount of information to do evolution of the system. The state of a pendulum is it's position and velocity (or momentum). If you take a single picture of a pendulum, you do not have a representation that lets you make predictions.

- information that is persisted through time. A stateful protocol is one where you need to know the history of the messages to understand what will happen next. (Or, analytically, it's enough to keep track of the sufficient state.) A procedure with some hidden state isn't a pure function. You can make it a pure function by making the state explicit.

gpm · 2025-07-07T02:00:24 1751853624

I agree. I'm suggesting that the language they are using is unintentionally misleading, not that they are factually wrong.

lostmsu · 2025-07-07T02:55:52 1751856952

This is wrong, intermediate activations are preserved when going forward.

ACCount36 · 2025-07-07T08:49:25 1751878165

Within a single forward pass, but not from one emitted token to another.

andy12_ · 2025-07-07T14:39:32 1751899172

What? No. The intermediate hidden states are preserved from one token to another. A token that is 100k tokens into the future will be able to look into the information of the present token's hidden state through the attention mechanism. This is why the KV cache is so big.

ACCount36 · 2025-07-08T09:55:37 1751968537

KV cache is just that: a cache.

The inference logic of an LLM remains the same. There is no difference in outcomes between recalculating everything and caching. The only difference is in the amount of memory and computation required to do it.

andy12_ · 2025-07-08T15:57:35 1751990255

The same can be said about any recurrent network. To predict the token n+1 you could recalculate the hidden state up to token n, or reuse the hidden state of token n from the previous forward pass. The only difference is the amount of memory and computation.

The thing is that, fundamentally, an auto-regressive transformer is a model whose state grows linearly with each token without compression, which is what bestows them with (theoretical) perfect recall.

8note · 2025-07-07T00:14:02 1751847242

this sounds like a fun research area. do LLMs have plans about future tokens?

how do we get 100 tokens of completion, and not just one output layer at a time?

are there papers youve read that you can share that support the hypothesis? vs that the LLM doesnt have ideas about the future tokens when its predicting the next one?

Zee2 · 2025-07-07T00:31:14 1751848274

This research has been done, it was a core pillar of the recent Anthropic paper on token planning and interpretability.

https://www.anthropic.com/research/tracing-thoughts-language...

See section “Does Claude plan its rhymes?”?

XenophileJKO · 2025-07-07T00:32:38 1751848358

Lol... Try building systems off them and you will very quickly learn concretely that they "plan".

It may not be as evident now as it was with earlier models. The models will fabricate preconditions needed to output the final answer it "wanted".

I ran into this when using quasi least-to-most style structured output.