More

pawelduda · 2025-12-11T16:14:47 1765469687

If you just ask it to find problems, it will do its best to find them - like running a while loop with no return condition. That's why I put some breaker in the prompt, which in this case would be "don't make any improvements if the positive impact is marginal". I've mostly seen it do nothing and just summarize why, followed by some suggestions in case I still want to force the issue

f311a · 2025-12-11T16:22:05 1765470125

I guess "marginal impact" for them is a pretty random metric, which will be different on each run. Will try it next time.

Another problem is that they try to add handling of different cases that are never present in my data. I have to mention that there is no need to update handling to be more generalized. For example, my code handles PNG files, and they add JPG handling that never happens.

pawelduda · 2025-12-11T16:08:18 1765469298

Did it create 200 CODE_QUALITY_IMPROVEMENTS.md files by chance?

pawelduda · 2025-12-10T18:43:22 1765392202

Unfortunately same about my XPS, looked promising but turned to shit faster than I'd expect

pawelduda · 2025-12-08T13:22:45 1765200165

What do you get with claude code that isn't already in cursor? I've only used it in cursor

pawelduda · 2025-12-07T17:10:57 1765127457

This screams AI, 100%

pawelduda · 2025-12-06T17:10:41 1765041041

Did anyone test it on 5090? I saw some 30xx reports and it seemed very fast

egeres · 2025-12-06T22:43:04 1765060984

Incredibly fast, on my 5090 with CUDA 13 (& the latest diffusers, xformers, transformers, etc...), 9 samplig steps and the "Tongyi-MAI/Z-Image-Turbo" model I get:

- 1.5s to generate an image at 512x512

- 3.5s to generate an image at 1024x1024

- 26.s to generate an image at 2048x2048

It uses almost all the 32Gb Gb of VRAM and GPU usage. I'm using the script from the HF post: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

SV_BubbleTime · 2025-12-07T04:54:55 1765083295

Weird, even at 2048 I don’t think it should be using all your 32GB VRAM.

egeres · 2025-12-07T11:08:30 1765105710

It stays around 26Gb at 512x512. I still haven't profiled the execution or looked much into the details of the architecture but I would assume it trades off memory for speed by creating caches for each inference step

SV_BubbleTime · 2025-12-07T15:44:42 1765122282

IDK. Seems odd. It’s an 11GB model, I don’t know what it could caching in ram.

Wowfunhappy · 2025-12-06T18:35:10 1765046110

Even on my 4080 it's extremely fast, it takes ~15 seconds per image.

accrual · 2025-12-06T23:56:10 1765065370

Did you use PyTorch Native or Diffusers Inference? I couldn't get the former working yet so I used Diffusers, but it's terribly slow on my 4080 (4 min/image). Trying again with PyTorch now, seems like Diffusers is expected to be slow.

Wowfunhappy · 2025-12-07T00:01:47 1765065707

Uh, not sure? I downloaded the portable build of ComfyUI and ran the CUDA-specific batch file it comes with.

(I'm not used to using Windows and I don't know how to do anything complicated on that OS. Unfortunately, the computer with the big GPU also runs Windows.)

accrual · 2025-12-07T00:19:38 1765066778

Haha, I know how it goes. Thanks, I'll give that a try!

Update: works great and much faster via ComfyUI + the provided workflow file.

pawelduda · 2025-12-04T21:53:58 1764885238

Sounds plausible but I guess it's something that they would've confirmed, had it been true

Or it was ABS-CF but they forgot to dry the filament /s

pawelduda · 2025-12-02T01:14:22 1764638062

Unless you know and trust person X, you don't want to authorize and interact with such contracts. Scammers will leave loopholes in code so they can, for example, grab all funds deposited to the contract.

Normal contracts that involve money operations would have safeguards that disallow the owner to touch balance that is not theirs. But there's billion of creative attack vectors to bypass that, either by that person X, or any 3rd party

pawelduda · 2025-12-01T17:20:22 1764609622

The end effect certainly gives off "understanding" vibe. Even if method of achieving it is different. The commenter obviously didn't mean the way human brain understands

pawelduda · 2025-11-18T16:28:13 1763483293

Why is this particular benchmark important?

aliljet · 2025-11-18T16:31:13 1763483473

Thus far, this is one of the best objective evaluations of real world software engineering...

RamtinJ95 · 2025-11-18T17:07:05 1763485625

I concur with the other commenters, 4.5 is a clear improvement over 4.

adastra22 · 2025-11-18T16:49:12 1763484552

Idk, Sonnet 4.5 score better than Sonnet 4.0 on that benchmark, but is markedly worse in my usage. The utility of the benchmark is fading as it is gamed.

meowface · 2025-11-18T16:51:16 1763484676

I think I and many others have found Sonnet 4.5 to generally be better than Sonnet 4 for coding.

adastra22 · 2025-11-18T17:07:06 1763485626

Maybe if you confirm to its expectations for how you use it. 4.5 is absolutely terrible for following directions, thinks it knows better than you, and will gaslight you until specifically called out on its mistake.

I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.

pawelduda · 2025-11-18T18:08:14 1763489294

I'm very happy with it TBH, it has some things that annoy me a little bit:

- slower compared to other models that will also do the job just fine (but excels at more complex tasks),

- it's very insistent on creating loads of .MD files with overly verbose documentation on what it just did (not really what I ask it to do),

- it actually deleted a file twice and went "oops, I accidentaly deleted the file, let me see if I can restore it!", I haven't seen this happen with any other agent. The task wasn't even remotely about removing anything

adastra22 · 2025-11-18T18:16:36 1763489796

The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

And yes, I have hooks to disable 'git reset', 'git checkout', etc., and warn the model not to use these commands and why. So it writes them to a bash script and calls that to circumvent the hook, successfully shooting itself in the foot.

Sonnet 4.5 will not follow directions. Because of this, you can't prevent it like you could with earlier models from doing something that destroys the worktree state. For longer-running tasks the probability of it doing this at some point approaches 100%.

ewoodrich · 2025-11-18T19:34:52 1763494492

> The last point is how it usually fails in my testing, fwiw. It usually ends up borking something up, and rather than back out and fix it, it does a 'git restore' on the file - wiping out thousands of lines of unrelated, unstaged code. It then somehow thinks it can recover this code by looking in the git history (??).

Man I've had this exact thing happen recently with Sonnet 4.5 in Claude Code!

With Claude I asked it to try tweaking the font weight of a heading to put the finishing touches on a new page we were iterating on. Looked at it and said, "Never mind, undo that" and it nuked 45 minutes worth of work by running git restore.

It immediately realized it fucked up and started running all sorts of git commands and reading its own log trying to reverse what it did and then came back 5 minutes later saying "Welp I lost everything, do you want me to manually rebuild the entire page from our conversation history?

In my CLAUDE.md I have instructions to commit unstaged changes frequently but it often forgets and sure enough, it forgot this time too. I had it read its log and write a post-mortem of WTF led it to run dangerous git commands to remove one line of CSS and then used that to write more specific rules about using git in the project CLAUDE.md, and blocked it from running "git restore" at all.

We'll see if that did the trick but it was a good reminder that even "SOTA" models in 2025 can still go insane at the drop of a hat.

adastra22 · 2025-11-18T23:17:35 1763507855

The problem is that I'm trying to build workflows for generating sequences of good, high quality semantically grouped changes for pull requests. This requires having a bunch of unrelated changes existing in the work tree at the same time, doing dependency analysis on the sequence of commits, and then pulling out / staging just certain features at a time and committing those separately. It is sooo much easier to do this by explicitly avoiding the commit-every-2-seconds workaround and keeping things uncommitted in the work tree.

I have a custom checkpointing skill that I've written that it is usually good about using, making it easier to rewind state. But that requires a careful sequence of operations, and I haven't been able to get 4.5 to not go insane when it screws up.

As I said though, watch out for it learning that it can't run git restore, so it immediately jumps to Bash(echo "git restore" >file.sh && chmod +x file.sh && ./file.sh).

meowface · 2025-11-18T19:02:22 1763492542

I think this is probably just a matter of noise. That's not been my experience with Sonnet 4.5 too often.

Every model from every provider at every version I've used has intermingled brilliant perfect instruction-following and weird mistaken divergence.

adastra22 · 2025-11-18T23:20:18 1763508018

What do you mean by noise?

In this case I can't get 4.5 to follow directions. Neither can anyone else, aparantly. Search for "Sonnet 4.5 follow instructions" and you'll find plenty of examples. The current top 2:

https://www.reddit.com/r/ClaudeCode/comments/1nu1o17/45_47_5...

https://theagentarchitect.substack.com/p/claude-sonnet-4-pro...

epolanski · 2025-11-18T19:21:26 1763493686

Not my experience at all, 4.5 is leagues ahead the previous models albeit not as good as Gemini 2.5.

pertymcpert · 2025-11-18T16:53:58 1763484838

I find 4.5 a much better model FWIW.