Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So they tested using training examples? Lmao


> held out


Actually in this case that's not exactly true:

> generation of 281,128 augmented examples

All example are already correlated because they are generated in the same way.


> All example are already correlated because they are generated in the same way.

All examples of “document information extraction” would be correlated no matter where they come from because they all would be “document information extraction” examples…

The real question is whether or not the examples are representative of the broad “document information extraction” use-case.


The problem is the methodology they use to hold them out. For a truly independent validation set, they need to hold out the material before augmentation, not after. If you hold out after augmentation, then you leverage biases from the training regimen already and hence you artificially boost your model's performance. This is not sufficient to demonstrate your model is generalizing properly.

In analogy: instead of taking leaves off of different trees, they are taking leaves from different branches from the same tree.


That would definitely make the evaluation more robust. My fear is that with LLMs at hand people became allergic to preparing good human-labelled evaluation sets and would always to some degree use an LLM as a crutch.


I would agree with that




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: