GPT-5.4 is already an incredible model for code reviews and security audits with the swival.dev /audit command.
The fact that GPT-5.5 is apparently even better at long-running tasks is very exciting. I don’t have access to it yet, but I’m really looking forward to trying it.
I really like local models for code reviews / security audits.
Even if they don't run super fast, I can let them work overnight and get comprehensive reports in the morning.
I used Qwen3.6-27B on an M5 (oq8, using omlx) and Swival (https://swival.dev) /audit command on small code bases I use for benchmarking models for security audits.
It found 8 out of 10, which is excellent for a local model, produced valid patches, and didn't report any false positives. which is even better.
It’s a complete mess, and the hardest part of this kind of tool is maintenance.
It’s not just about incompatible APIs, but also about how messages are structured. Even getting reliable tool calling requires a significant amount of work and testing for each individual model.
Just look at LiteLLM’s commit history and open issues/PRs. They’re still struggling with reliable multi-turn tool calling for Gemini, Kimi requires hardcoded rules (so K2.6 is currently unsupported because it’s not on the list), and so on.
Implementing the basic, generic OpenAI/Anthropic protocols is trivial, and at that point it almost feels like building an AI gateway is done. But it isn’t — that’s just the beginning of a long journey of constantly dealing with bugs, changes, and the quirks of each provider and model.
loading third party agents in a sandbox with full custom model support. right now you need to either run that code directly (super dangerous) use a vm/container (slow and complicated) or a interpreter like lua (language bound, slow and weak security). wasm is perfect for this, its almost native speed, built for security and language neutral. onnx and coreml are secure but they can only do the actual model not all the code around it.
reply