I am one of the co-authors of the original AgentDojo benchmark done at ETH. Agent security is indeed a very hard problem, but we have found it quite promising to apply formal methods like static analysis to agents and their runtime state[1], rather than just scanning for jailbreaks.
A general solution is hard, but what is quite promising is to apply static formal analysis to agents and their runtime state, which is what me and my team coming out of ETH, have started doing: https://github.com/invariantlabs-ai/invariant
I applaud you for trying to tackle this, but after reading your docs a little I am skeptical of your approach.
Your example of having a rule saying that the user's email address is not put in a search query seems to have two problems:
a) non-LLM models can be bypassed by telling LLMs to encode the email tokens, trivially ROT13, or many other encoding schemes
b) the LLM checkers suffer from the same prompt injection problems
In particular, gradient-based methods are unsurprisingly a lot better at defeating all the proposed mitigations, e.g. https://arxiv.org/abs/2403.04957
For now I think the solutions are going to have to be even less general than your toolkit here.
[1] https://github.com/invariantlabs-ai/invariant?tab=readme-ov-...