Slow inference on rtx 3090
Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?
2026-03-17 02:31:40.105 | INFO | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669])
1%|β | 474/32098 [01:28<1:38:56, 5.33it/s]
2026-03-17 02:33:10.021 | INFO | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds
2026-03-17 02:33:10.022 | INFO | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec
2026-03-17 02:33:10.022 | INFO | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s
2026-03-17 02:33:10.023 | INFO | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB
=== Generation Complete! ===
Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.
Thanks in advance.
it does have fa built in
Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?2026-03-17 02:31:40.105 | INFO | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669]) 1%|β | 474/32098 [01:28<1:38:56, 5.33it/s] 2026-03-17 02:33:10.021 | INFO | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds 2026-03-17 02:33:10.022 | INFO | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec 2026-03-17 02:33:10.022 | INFO | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s 2026-03-17 02:33:10.023 | INFO | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB === Generation Complete! ===Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.Thanks in advance.
Yeah, I believe the minimum requirements are a 5090 with at least 24GB VRAM
I had similar performance on my 3090; something on the order of 5x slower than realtime, if I recall.
Re: GSherman's comment - the 3090 does have 24 GB of VRAM. (And the 5090 has 32 GB, so - I'm not sure what "a 5090 with at least 24 GB of VRAM" refers to other than a 5090.)
i not sure bra
I get the same speed on my 3090
I'm getting around 24 it/s on a 3090ti using the awesome_webui. This model does use a LOT of VRAM though, max at 22gb. Limited to 2048 text tokens :/
Finally, I installed S2 on WSL2 (the Linux subsystem for Windows) and achieved 24 it/s: a speed increase of around 350%. The problem then was: Windows
(What a surprise ποΈπποΈ) lol
My experience with the slow inference was on a bare metal linux system - but, i didn't take note of the it/s rate.
Blakus - about how long was the audio, and how long did it take to generate?
My experience with the slow inference was on a bare metal linux system - but, i didn't take note of the it/s rate.
Blakus - about how long was the audio, and how long did it take to generate?
The length of audio was around 20, 25 seconds. And the inference time around 10, 15 seconds. Max VRAM 21 - 22.5 GB