Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Given the fact that nobody actually knows how to solve this problem to a reliability level that is actually acceptable, I don't know how the conclusion here isn't that Agents are fundamentally flawed unless they don't need to access any particularly sensitive APIs without supervision or that they just don't operate on any attacker controlled data?

None of this eval framework stuff matters since we generally know we don't have a solution.



A general solution is hard, but what is quite promising is to apply static formal analysis to agents and their runtime state, which is what me and my team coming out of ETH, have started doing: https://github.com/invariantlabs-ai/invariant


I applaud you for trying to tackle this, but after reading your docs a little I am skeptical of your approach.

Your example of having a rule saying that the user's email address is not put in a search query seems to have two problems: a) non-LLM models can be bypassed by telling LLMs to encode the email tokens, trivially ROT13, or many other encoding schemes b) the LLM checkers suffer from the same prompt injection problems

In particular, gradient-based methods are unsurprisingly a lot better at defeating all the proposed mitigations, e.g. https://arxiv.org/abs/2403.04957

For now I think the solutions are going to have to be even less general than your toolkit here.


Maybe we should work on solving that problem, then? And maybe this is what working on that problem looks like?


Eval sets are not an appropriate tool for evaluating progress on security problems since the bar here is 100% correctness in the face of sustained targeted adversarial effort.

This work largely resembles the Politician's syllogism; it's something, but it's not actually addressing the problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: