Note that the benchmarks used for comparison are basically measuring the model’s...

Note that the benchmarks used for comparison are basically measuring the model’s ability to understand financial content. In other words, reading comprehension for English, just in a specific domain. It shouldn’t really be surprising that a strong generalist model performs well here.

On the other hand, GPT-4 actually did worse on the NER task - labelling and tagging terms used in the text - vs their finetuned model. I assume the finetuned model was better at using the specific labels they were targeting.