To be somewhat fair, their numeracy scatter plot visually looks more like a dart board than any real process. The fact that you got a positive-sloping line out at all from the regression has to do more with the positions of the outliers than anything else. You're at least able to visually examine the language aptitude plot and see an up-and-to-the right connection between the two variables.
I agree, but also via some napkin math, the chance of getting those different results is something like 40%.
(Ie, if you sample the same signal twice for the numbers they did in the study, there’s a 40% chance it’ll be >0.04 away from the original sample, as numeracy was from language)