Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time?

Benchmarks could track that too - I don't know if they do, but that information should actually be available and easy to get.

When models are scored on e.g. "pass10", i.e. pass the challenge in under 10 attempts, and then the benchmark is rerun periodically, that literally produces the information you're asking for: how frequently a given model fails at particular task.

> A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result.

For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place. For those tasks, LLMs are very useful.

> If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.

How can you be sure whether a human you're asking isn't hallucinating/guessing the answer, or straight up bullshitting you? Apply the same approach to LLMs as you apply to navigating this problem with humans - for example, don't ask it to solve high-consequence problems in areas where you can't evaluate proposed solutions quickly.



> For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place.

A good example that I use frequently is a reverse dictionary.

It's also useful for suggesting edits to text that I have written. It's easy for me to read its suggestions and accept/reject them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: