Incredibly fast, on my 5090 with CUDA 13 (& the latest diffusers, xformers, transformers, etc...), 9 samplig steps and the "Tongyi-MAI/Z-Image-Turbo" model I get:
It stays around 26Gb at 512x512. I still haven't profiled the execution or looked much into the details of the architecture but I would assume it trades off memory for speed by creating caches for each inference step
Did you use PyTorch Native or Diffusers Inference? I couldn't get the former working yet so I used Diffusers, but it's terribly slow on my 4080 (4 min/image). Trying again with PyTorch now, seems like Diffusers is expected to be slow.
Uh, not sure? I downloaded the portable build of ComfyUI and ran the CUDA-specific batch file it comes with.
(I'm not used to using Windows and I don't know how to do anything complicated on that OS. Unfortunately, the computer with the big GPU also runs Windows.)