Could you not detect likely hallucinations by running the same prompt multiple times between different models and looking at the vector divergence between the outputs? Kind of like an agreement between say GPT, Llama, other models which all agree - yes, this is likely a hallucination.
It's not 100% but enough to basically say to the human: "hey, look at this".
You can do it and it's a good way of doing that - from our experiments that can catch most errors. You don't even need to use different models - even using the same model (I don't mean asking "are you sure?" - just re-running the same workflow) will give you nice results. The only problem is that it's super expensive to run it on all your traces so I wouldn't recommend that as a monitoring tool.
It's not 100% but enough to basically say to the human: "hey, look at this".