Note that this isn't improving the LLM itself, but the software glue around it (i.e. agentic loops, tools, etc). The fact that using the same LLM got ~20% increase on the aider leaderboard speaks more about aider as a collection of software glue, than it does about the model.
I do wonder though if big labs are running this with model training episodes as well.
I do wonder though if big labs are running this with model training episodes as well.