How are any of these a useful path to asking an AI to cook dinner?
We already know many tasks that most humans can do relatively easily, yet most people don’t expect AI to be able to do them for years to come (for instance, L5 self-driving). ARC-AGI appears to be going in the opposite direction - can these models pass tests that are difficult for the average person to pass.
These benchmarks are interesting in that they show increasing capabilities of the models. But they seem to be far less useful at determining AGI than the simple benchmarks we’ve had all along (can these models do everyday tasks that a human can do?).