Yeah, cause these are the kinds of very advanced things we'll use these models f...

Yeah, cause these are the kinds of very advanced things we'll use these models for in the wild. /s

It's strange that these tests are frequent. Why would people think this is a good use of this model or even a good proxy for other more sophisticated "soft" tasks?

Like to me, a better test is one that tests for memorization of long-tailed information that's scarce on the internet. Reasoning tests like this are so stupid they could be programmed, or you could hook up tools to these LLMs to process them.

Much more interesting use cases for these models exist in the "soft" areas than 'hard', 'digital', 'exact', 'simple' reasoning.

I'd take an analogical over a logical model any day. Write a program for Sally.