Note that the benchmarks used for comparison are basically measuring the model’s ability to understand financial content. In other words, reading comprehension for English, just in a specific domain. It shouldn’t really be surprising that a strong generalist model performs well here.
On the other hand, GPT-4 actually did worse on the NER task - labelling and tagging terms used in the text - vs their finetuned model. I assume the finetuned model was better at using the specific labels they were targeting.
On the other hand, GPT-4 actually did worse on the NER task - labelling and tagging terms used in the text - vs their finetuned model. I assume the finetuned model was better at using the specific labels they were targeting.