More

ottah · 2026-02-17T00:28:33 1771288113

Until we can do reinforcement in a reasonable approximate model of the real world, I don't see AI getting substantially better. We're seeing a lot of refinement of capabilities, but everything is still mostly supervised or limited semi-supervised learning.

ottah · 2026-02-16T22:24:47 1771280687

This is perfect, I've been looking for something like this for my home network. Tailscale requires too much trust and is only partly open-source. Diy wireguard works, but Comcast has starting messing with packets, and our IP changes a lot. A self-hosted vpn to bridge consumer isp and public networks I can put on a vps, is a lot easier to trust.

ottah · 2026-02-15T21:42:39 1771191759

Next they'll try to make Tor and I2p illegal.

ottah · 2026-02-14T05:16:40 1771046200

Well, if wasm process is limited on the syscalls it can make, the blast radius is limited. For example you can block network access, and disk access for tools that don't need those capabilities.

That being said, this doesn't sound like they're really thinking through the risks.

> Dynamic Tool Building - Describe what you need, and IronClaw builds it as a WASM tool

If the agent can write it's own insecure plugins, and the wasm processes isn't properly isolated, you've really gained nothing.

itissid · 2026-02-14T14:55:59 1771080959

even if it is isolated, like no network or host access. Like say the malicious prompt created a wasm tool that patched your project code to leak information like adding a logger.warning. but LOG_LEVEL was set to error or whatever that prevented this from surfacing during testing or dev/beta.

Again running on that was container that code does not reveal anything. But then another isolated wasm tool was responsible to build the binary and ship it to prod.

Shotgunned all over prod logs are spotted by a log watcher within minutes of deploy. Whew... right?

But you are already screwed.

ottah · 2026-02-13T18:30:14 1771007414

You hope? I'm struggling to read this properly. You're not wanting the outcome to be tyranny, right?

ottah · 2026-02-11T03:03:39 1770779019

We will still have flu vaccines, just not this vaccine.

altcognito · 2026-02-11T04:37:18 1770784638

The question isn't whether or not we have vaccines, it is whether or not we have the most effective vaccines.

alex43578 · 2026-02-11T07:34:32 1770795272

It’s a good thing the specific criticism of this trial is that they didn’t use the most effective vaccine for 65+ people, since you’re concerned about having the most effective vaccines.

altcognito · 2026-02-12T16:54:14 1770915254

How do you know if you don’t do a study?

You can call anything a criticism, but it doesn’t make it true.

watwut · 2026-02-11T11:58:24 1770811104

So, the majority of us, people under 65 are completely unaffected. And yes, the vaccine can be approved for 65- while not approved for 65+.

alex43578 · 2026-02-11T13:36:58 1770817018

The fall 2025 approval was limited to 65+/preexisting conditions.

If this vaccine wasn’t being tested for 65+, it might not be approved at all based on that.

ajb · 2026-02-11T07:39:50 1770795590

Older flu vaccines become less effective, as there are many flu strains and the dominant one changes. Different flu vaccine is recommended every year.

ottah · 2026-02-11T18:33:34 1770834814

Let's just be plain as possible because online commenters are some of the most obtuse people.

1. Vaccines are good, everyone should get the fucking flu shot

2. There will be new vaccines targeting current strains of influenza available this season from other manufacturers using older methods

3. I have no fucking clue if an mRNA flu vaccine is good or bad, but I also don't care

4. I get mrna vaccines can be developed faster, we might be better served, but we are not losing existing capabilities

5. If we are short on vaccines produced with older methods that is likely poor business planning and not an actual technical limitation

6. I hate you

ottah · 2026-02-09T23:39:29 1770680369

Or lack of design aesthetic. It has no opinions except for safe ones.

ottah · 2026-02-06T15:47:14 1770392834

Growth is all that matters. There is perceived to be much less potential growth in retail than there is in tech. You have to remember, most people literally think of computers in magical terms, and what's possible is usually more anchored by what they see in movies than what they experience in real life. So believing that Sam Altman is going to manage to capture all economic output of labor is seen as a realistic belief. Believing that Amazon will replace all retail in the world is obviously never going to happen.

ottah · 2026-02-05T19:03:38 1770318218

Honestly my job is to ensure code quality and to protect the customer. I love working with claude code, it makes my life easier, but in no way would a team of agents improve code quality or speed up development. I would spend far too much time reviewing and fixing laziness and bad design decisions.

When you hear execs talking about AI, it's like listening to someone talk about how they bought some magic beans that will solve all their problems. IMO the only thing we have managed to do is spend alot more money on accelerated compute.

ottah · 2026-02-05T18:56:47 1770317807

I absolutely cannot trust Claude code to independently work on large tasks. Maybe other people work on software that's not significantly complex, but for me to maintain code quality I need to guide more of the design process. Teams of agents just sounds like adding a lot more review and refactoring that can just be avoided by going slower and thinking carefully about the problem.

nickstinemates · 2026-02-05T20:27:49 1770323269

You write a generic architecture document on how you want your code base to be organized, when to use pattern x vs pattern y, examples of what that looks like in your code base, and you encode this as a skill.

Then, in your prompt you tell it the task you want, then you say, supervise the implementation with a sub agent that follows the architecture skill. Evaluate any proposed changes.

There are people who maximize this, and this is how you get things like teams. You make agents for planning, design, qa, product, engineering, review, release management, etc. and you get them to operate and coordinate to produce an outcome.

That's what this is supposed to be, encoded as a feature instead of a best practice.

satellite2 · 2026-02-05T20:30:50 1770323450

Aren't you just moving the problem a little bit further? If you can't trust it will implement carefully specified features, why would you believe it would properly review those?

frde_me · 2026-02-05T21:41:05 1770327665

It's hard to explain, but I've found LLMs to be significantly better in the "review" stage than the implementation stage.

So the LLM will do something and not catch at all that it did it badly. But the same LLM asked to review against the same starting requirement will catch the problem almost always

The missing thing in these tools is that automatic feedback loop between the two LLMs: one in review mode, one in implementation mode.

resonious · 2026-02-05T21:51:18 1770328278

I've noticed this too and am wondering why this hasn't been baked into the popular agents yet. Or maybe it has and it just hasn't panned out?

bashtoni · 2026-02-05T21:56:42 1770328602

Anecdotaly I think this is in Claude Code. It's pretty frequent to see it implement something, then declare it "forgot" a requirement and go back and alter or add to the implementation.

cbovis · 2026-02-06T18:11:04 1770401464

AFAICT this is already baked into the GitHub Copilot agent. I read its sessions pretty often and reviewing/testing after writing code is a standard part of its workflow almost every time. It's kind of wild seeing how diligent it is even with the most trivial of changes.

bethekidyouwant · 2026-02-06T00:51:46 1770339106

You have to dump the context window for the review to work good.

tclancy · 2026-02-05T20:31:44 1770323504

How does this not use up tokens incredibly fast though? I have a Pro subscription and bang up against the limits pretty regularly.

doctoboggan · 2026-02-05T20:33:52 1770323632

It _does_ use up tokens incredibly fast, which is probably why Anthropic is developing this feature. This is mostly for corporations using the API, not individuals on a plan.

digdugdirk · 2026-02-05T20:44:16 1770324256

I'd love to see a breakdown of the token consumption of inaccurate/errored/unused task branches for claude code and codex. It seems like a great revenue source for the model providers.

shafyy · 2026-02-05T21:06:33 1770325593

Yeah, that's what I was thinking. They do have an incentive to not get everything right on the first try, as long as they don't over do it... I also feel like that they try to get more token usage by asking unnecesary follow up questions that the user may say yes to etc.

indemnity · 2026-02-06T06:31:07 1770359467

I had to go to Max, Pro is more like a taster.

At work tho we use Claude Code thru a proxy that uses the model hosted on AWS bedrock. It’s slower than consumer direct-to-Anthropic and you have to wait a bit for the latest models (Opus 4.5 took a while to get), but if our stats are to be believed it’s much much cheaper.

nickstinemates · 2026-02-06T01:59:10 1770343150

I don't know, all I can say is with API-based billing, doing multi-thousand like refactors that would take days to do costs like $4. In terms of value : effort, it's incredible.

andyferris · 2026-02-05T21:08:52 1770325732

It does use tokens faster, yes.

aqme28 · 2026-02-05T19:50:06 1770321006

I agree, but I've found that making an "adversarial" model within claude helps with the quality a lot. One agent makes the change, the other picks holes in it, and cycle. In the end, I'm left with less to review.

This sounds more like an automation of that idea than just N-times the work.

Keyframe · 2026-02-05T20:22:17 1770322937

Glad I'm not the only one. I do the same, but I tend to have gemini be the one that critiques.

diego898 · 2026-02-05T20:28:24 1770323304

Do you do this manually? Or some abstraction above that? skills, some light orchestration, etc?

aqme28 · 2026-02-05T20:33:15 1770323595

I just tell it to do so, but you could even add that as a requirement to CLAUDE.md

stpedgwdgfhgdd · 2026-02-05T19:27:53 1770319673

Exactly, one out of four or three prompts require tuning, nudging or just stopping it. However it takes seniority to see where it goes astray. I suspect that lots of folks dont even notice that CC is off. It works, it passes the tests, so it is good.

turtlebits · 2026-02-05T19:55:14 1770321314

Humans can't handle large tasks either, which is why you break them into manageable chunks.

Just ask claude to write a plan and review/edit it yourself. Add success criteria/tests for better results.

BonoboIO · 2026-02-05T19:02:35 1770318155

You definitely have to create some sort of PLAN.md and PROGRESS.md via a command and an implement command that delegates work. That is the only way that I can get bigger things done no matter how „good“ their task feature is.

You run out of context so quickly and if you don’t have some kind of persistent guidance things go south

ottah · 2026-02-05T19:10:31 1770318631

It's not sufficient, especially if I am not learning about the problem by being part of the implementation process. The models are still very weak reasoners, writing code faster doesn't accelerate my understanding of the code the model wrote. Even with clear specs I am constantly fighting with it duplicating methods, writing ineffective tests, or implementing unnecessarily complex solutions. AI just isn't a better engineer than me, and that makes it a weak development partner.

vonneumannstan · 2026-02-05T20:36:06 1770323766

>AI just isn't a better engineer than me, and that makes it a weak development partner.

This would also be true of Junior Engineers. Do you find them impossible to work with as well?

koakuma-chan · 2026-02-05T19:07:15 1770318435

I tried doing that and it didn't work. It still adds "fallbacks" that just hide errors or the fact that there is no actual implementation and "In a real app, we would do X, just return null for now"

nprz · 2026-02-05T19:10:54 1770318654

There is research[0] currently being done on how to divide tasks and combine the answers to LLMs. This approach allows LLMs reach outcomes (solving a problem that requires 1 million steps) which would be impossible otherwise.

[0]https://arxiv.org/abs/2511.09030

woah · 2026-02-05T20:17:33 1770322653

All they did was prompt an LLM over and over again to execute one iteration of a towers of hanoi algorithm. Literally just using it as a glorified scripting language:

```

Rules:

- Only one disk can be moved at a time.

- Only the top disk from any stack can be moved.

- A larger disk may not be placed on top of a smaller disk.

For all moves, follow the standard Tower of Hanoi procedure: If the previous move did not move disk 1, move disk 1 clockwise one peg (0 -> 1 -> 2 -> 0).

If the previous move did move disk 1, make the only legal move that does not involve moving disk1.

Use these clear steps to find the next move given the previous move and current state.

Previous move: {previous_move} Current State: {current_state} Based on the previous move and current state, find the single next move that follows the procedure and the resulting next state.

```

This is buried down in the appendix while the main paper is full of agentic swarms this and millions of agents that and plenty of fancy math symbols and graphs. Maybe there is more to it, but the fact that they decided to publish with such a trivial task which could be much more easily accomplished by having an llm write a simple python script is concerning.

Spoom · 2026-02-05T21:41:46 1770327706

Good lord, I can only imagine the wasted electricity.

ottah · 2026-02-05T19:14:03 1770318843

No offense to the academic profession, but they're not a good source of advice for best practices in commercial software development. They don't have the experience or the knowledge sufficient to understand my workplace and tasks. Their skill set and job is orthogonal to the corporate world.

nprz · 2026-02-05T19:19:13 1770319153

Yes, the problem solved in the paper (Tower of Hanoi) is far more easily defined than 99% of actual problems you would find in commercial software development. Still proof of "theoretically possible" and seems like an interesting area of research.

findjashua · 2026-02-05T20:36:27 1770323787

you need a reviewer agent for every step of the process - review the plan generated by the planner, the update made by the task worker subagent, and a final reviewer once all tasks are done.

this does eat up tokens _very_ quickly though :(