What all top models do is recombine at test time the knowledge they already have. So they all possess Core Knowledge priors. Techniques to acquire them vary:
* Use a pretrained LLM and hope that relevant programs will be memorized via exposure to text data (this doesn't work that well)
* Pretrain a LLM on ARC-AGI-like data
* Hardcode the priors into a DSL
> Which is to say, a data augmentation approach
The key bit isn't the data augmentation but the TTT. TTT is a way to lift the #1 issue with DL models: that they cannot recombine their knowledge at test time to adapt to something they haven't seen before (strong generalization). You can argue whether TTT is the right way to achieve this, but there is no doubt that TTT is a major advance in this direction.
The top ARC-AGI models perform well not because they're trained on tons of data, but because they can adapt to novelty at test time (usually via TTT). For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy. This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation.
Do you mean the ones from your white paper? The same ones that humans possess? How do you know this?
>> The key bit isn't the data augmentation but the TTT.
I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?
>> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.
>This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation
Now that the current challenge is over, and a successor dataset is in the works, can we see how well the leading LLMs perform against the private test set?
For example, Claude 3.5 gets 14% in semi-private eval vs 21% in public eval. I remember reading an explanation of "semi-private" earlier but cannot find it now.
* Use a pretrained LLM and hope that relevant programs will be memorized via exposure to text data (this doesn't work that well)
* Pretrain a LLM on ARC-AGI-like data
* Hardcode the priors into a DSL
> Which is to say, a data augmentation approach
The key bit isn't the data augmentation but the TTT. TTT is a way to lift the #1 issue with DL models: that they cannot recombine their knowledge at test time to adapt to something they haven't seen before (strong generalization). You can argue whether TTT is the right way to achieve this, but there is no doubt that TTT is a major advance in this direction.
The top ARC-AGI models perform well not because they're trained on tons of data, but because they can adapt to novelty at test time (usually via TTT). For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy. This demonstrates empirically that ARC-AGI cannot be solved purely via memorization and interpolation.