throwawai123's comments

throwawai123 · on March 16, 2025

I am one of the co-authors of the original AgentDojo benchmark done at ETH. Agent security is indeed a very hard problem, but we have found it quite promising to apply formal methods like static analysis to agents and their runtime state[1], rather than just scanning for jailbreaks.

[1] https://github.com/invariantlabs-ai/invariant?tab=readme-ov-...

throwawai123 · on March 16, 2025

A general solution is hard, but what is quite promising is to apply static formal analysis to agents and their runtime state, which is what me and my team coming out of ETH, have started doing: https://github.com/invariantlabs-ai/invariant

Eridrus · on March 16, 2025

I applaud you for trying to tackle this, but after reading your docs a little I am skeptical of your approach.

Your example of having a rule saying that the user's email address is not put in a search query seems to have two problems: a) non-LLM models can be bypassed by telling LLMs to encode the email tokens, trivially ROT13, or many other encoding schemes b) the LLM checkers suffer from the same prompt injection problems

In particular, gradient-based methods are unsurprisingly a lot better at defeating all the proposed mitigations, e.g. https://arxiv.org/abs/2403.04957

For now I think the solutions are going to have to be even less general than your toolkit here.

throwawai123 · on Feb 21, 2025

An incredible article on so many levels.

1. That they have built such a system based on Llama

2. That OpenAI has a “principal investigator” (with a hilarious picture)

3. That OpenAI can monitor what type of software people are building, when they use Copilot/ChatGPT etc..