Slow inference on rtx 3090

#16

by Blakus - opened 4 days ago

Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?

2026-03-17 02:31:40.105 | INFO     | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669])
  1%|█                                                                           | 474/32098 [01:28<1:38:56,  5.33it/s]
2026-03-17 02:33:10.021 | INFO     | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s
2026-03-17 02:33:10.023 | INFO     | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB

=== Generation Complete! ===

Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.

Thanks in advance.

edwixx

4 days ago

it does have fa built in

GSherman

4 days ago

Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?

2026-03-17 02:31:40.105 | INFO     | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669])
  1%|█                                                                           | 474/32098 [01:28<1:38:56,  5.33it/s]
2026-03-17 02:33:10.021 | INFO     | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s
2026-03-17 02:33:10.023 | INFO     | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB

=== Generation Complete! ===

Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.

Thanks in advance.

Yeah, I believe the minimum requirements are a 5090 with at least 24GB VRAM

GeoMaciolek

3 days ago

I had similar performance on my 3090; something on the order of 5x slower than realtime, if I recall.

Re: GSherman's comment - the 3090 does have 24 GB of VRAM. (And the 5090 has 32 GB, so - I'm not sure what "a 5090 with at least 24 GB of VRAM" refers to other than a 5090.)

rusheedz

3 days ago

i not sure bra

Hack337

3 days ago

I get the same speed on my 3090

Biogoly

1 day ago

•

edited 1 day ago

I'm getting around 24 it/s on a 3090ti using the awesome_webui. This model does use a LOT of VRAM though, max at 22gb. Limited to 2048 text tokens :/

Blakus

1 day ago

•

edited 1 day ago

Finally, I installed S2 on WSL2 (the Linux subsystem for Windows) and achieved 24 it/s: a speed increase of around 350%. The problem then was: Windows

(What a surprise 👁️👄👁️) lol

Blakus changed discussion status to closed 1 day ago

GeoMaciolek

about 15 hours ago

My experience with the slow inference was on a bare metal linux system - but, i didn't take note of the it/s rate.

Blakus - about how long was the audio, and how long did it take to generate?

Blakus

about 14 hours ago

My experience with the slow inference was on a bare metal linux system - but, i didn't take note of the it/s rate.

Blakus - about how long was the audio, and how long did it take to generate?

The length of audio was around 20, 25 seconds. And the inference time around 10, 15 seconds. Max VRAM 21 - 22.5 GB

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment