Given the fact that nobody actually knows how to solve this problem to a reliabi...

throwawai123 · on March 16, 2025

A general solution is hard, but what is quite promising is to apply static formal analysis to agents and their runtime state, which is what me and my team coming out of ETH, have started doing: https://github.com/invariantlabs-ai/invariant

Eridrus · on March 16, 2025

I applaud you for trying to tackle this, but after reading your docs a little I am skeptical of your approach.

Your example of having a rule saying that the user's email address is not put in a search query seems to have two problems: a) non-LLM models can be bypassed by telling LLMs to encode the email tokens, trivially ROT13, or many other encoding schemes b) the LLM checkers suffer from the same prompt injection problems

In particular, gradient-based methods are unsurprisingly a lot better at defeating all the proposed mitigations, e.g. https://arxiv.org/abs/2403.04957

For now I think the solutions are going to have to be even less general than your toolkit here.

taneq · on March 16, 2025

Maybe we should work on solving that problem, then? And maybe this is what working on that problem looks like?

Eridrus · on March 16, 2025

Eval sets are not an appropriate tool for evaluating progress on security problems since the bar here is 100% correctness in the face of sustained targeted adversarial effort.

This work largely resembles the Politician's syllogism; it's something, but it's not actually addressing the problem.