Can anyone tell me what sort of speed I'd be looking at on an rtx 5060 ti 16gbs?

#2
by AlexTheSandGrinder - opened

Anyone has the card? (ON FP4, I mean)

Owner

For nvfp4, the speedup is pretty significant, on my 5070 ti at least, I haven't exactly measured it, but it's very obvious that it's faster.
Side note though, nvfp4 doesn't work very well for ltx 2.3. Even the official quant runs into the same issues as this one. If you're using it with the distil lora, you'll probably see issues where outputs are grayish, or become gray over time for image to video. I recommend fp8 scaled, which doesn't really get a speedup though, but with distil lora should still be fast enough.

Thanks for the reply, but how could you not have measured it? 😁
You have an estimate at least? 5-6 minutes for a 5 sec video? Less?
Also, what about your WAN nvfp4? https://huggingface.co/GitMylo/Wan_2.2_nvfp4

I'm mainly asking because I'm eyeing an rtx 5060 ti 16GB (my budget does not allow for more, obviously. Or I'd have got something better) and am curious what sort of speed I'd get on these Quants, be it WAN 2.2, or LTX.

Owner

I didn't make direct comparisons, but I can provide some generation times. Distilled LTX 2 is faster than distilled wan for me in any case because it's both more efficient and only uses one model.
On my 5070 ti, using distil, 8 steps 24 fps (frame counts included with times).

(Times are longer since it's t2i2v, my main bottleneck is offloading since I've got ddr4 ram and use pagefile for over 48gb) Because of this inclusion while "seconds faster" is valid here, "percentage speedup" wouldn't apply due to the amount of time spent on the t2i and the re-loading of text encoder and model. Also didn't exactly keep track of resolutions, so there might be variance there. Generally for nvfp4 960x960 or re-ratiod variants were best. fp8 worked fine on 720x720 as well. Generally I used those resolutions although some gens used different resolutions. Swarmui doesn't keep track of resolutions used.

fp8 (most are likely 720x720, except for the ones marked)

  • 3.41 min (371 frames)
  • 3.16 min (371 frames)
  • 3.21 min (371 frames)
  • 2.66 min (241 frames)
  • 2.65 min (241 frames)
  • 3.88 min (241 frames, might have been 960x960 here)
  • 3.93 min (241 frames, probably 960)

nvfp4 (most are likely 960x960)

  • 2.66 min (125 frames, 16 step)
  • 2.41 min (125 frames, 16 step)
  • 2.01 min (125 frames, back to 8 step)
  • 119.02 sec (125 frames)
  • 2.00 min (125 frames)
  • 113.22 sec (121 frames)
  • 114.44 sec (121 frames)
  • 2.04 min (250 frames, probably 720x720)
  • 2.13 min (250 frames, again)
  • 92.16 sec (125 frames, probably 720)

What's useful to note is that ltx's sampling is like 10% of the generation times for me, which makes it hard to benchmark from just generation times. The sampling itself usually is only a couple seconds, like 5s/it so 40 seconds for 241 frames 720 res estimated.

As for the wan version, I didn't personally find huge speedups, but people have said it made a big difference, I'm mostly bottlenecked by memory and disk speed anyways.

Thank you so much for your detailed response. My card did arrive, eventually. But with one of your FP4 models I'm not seeing any speed gains (same speed as FP16). Do I need a custom node for it? I'm aware lots of things might be wrong with my setup, I'm not asking that you troubleshoot. But I'm getting:

Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.float16, manual cast: torch.float16
model_type FLOW

So... it's upcasting to f16. I assume I have to somehow force it to run at FP4?

In comfy it should work, on launch check what it prints for comfy kitchen backends, I believe the cuda backend for comfy kitchen should be available to get a speedup.
Also make sure your pytorch is up to date and running with cuda 13 (or higher)

Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_mxfp8', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_mxfp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_mxfp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Checkpoint files will always be loaded safely.
Total VRAM 16311 MB, total RAM 32674 MB
pytorch version: 2.11.0+cu130
xformers version: 0.0.35
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5060 Ti : cudaMallocAsync

Anything the issue here?

Okay... I got it working . But for WAN 2.2 I also see no extraordinary speed bump (380 seconds to 360 seconds total - 8 steps). a mixed-quant wan model vs FP4 model.

Edit: Perhaps the cache was old? Anyway, now it's from 40s/step to 30s/step (same models' difference)

Sign up or log in to comment