I think what the parent was trying to communicate (and what I'm thinking as well) is doubting your premise in 1. ("the model must be thinking beyond the next token").
Rephrase "The model is good at picking the correct article for the word it wants to output next" to "After having picked a specific article, the model is good at picking a follow-up noun that matches the chosen article". Nothing about the second statement seems like an unlikely feat for a model that only predicts one word at a time without any thinking ahead about specific words.
>I climbed up the pear tree and picked a pear. I climbed up the apple tree and picked
The argument made in the article (IMO an extremely convincing one) is that it wouldn't be able to predict the word 'an' except by observing that the word afterwards must be apple. Otherwise why not pick 'a'?
Rephrase "The model is good at picking the correct article for the word it wants to output next" to "After having picked a specific article, the model is good at picking a follow-up noun that matches the chosen article". Nothing about the second statement seems like an unlikely feat for a model that only predicts one word at a time without any thinking ahead about specific words.