Instructions to use stepfun-ai/Step-3.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stepfun-ai/Step-3.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="stepfun-ai/Step-3.7-Flash", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.7-Flash", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use stepfun-ai/Step-3.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stepfun-ai/Step-3.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/stepfun-ai/Step-3.7-Flash

SGLang

How to use stepfun-ai/Step-3.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stepfun-ai/Step-3.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stepfun-ai/Step-3.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use stepfun-ai/Step-3.7-Flash with Docker Model Runner:
```
docker model run hf.co/stepfun-ai/Step-3.7-Flash
```

Supers Found in model

#11

by tcclaviger - opened 1 day ago

Discussion

tcclaviger

1 day ago

•

edited 1 day ago

For those doing quantization/REAP/REAM work here you go:

cloudyu

1 day ago

how about coding performence?

tcclaviger

1 day ago

Not sure yet, still processing the model :P

I don't have a lot of hope it'll be beating Qwen3.6-27B or Gemma4-31b, but maybe?

tarruda

about 12 hours ago

@tcclaviger can you share how you obtain that super expert table? While I like this model in the API, I found that ~4-bit GGUF quants can result in infinite reasoning loops. Might try quantizing while keeping these super expert layers at higher precisions.

tcclaviger

about 9 hours ago

•

edited about 9 hours ago

I ran a full REAP dataset through it by patching the REAP repo tools to work with it Step3.7. Split was 0.2/0.3/0.5 for math/agentic tools/coding datasets to get the activations using consistent size/sampling per the REAP paper.

Then I cut it down to 240 experts from 288 using REAP activation output/frequency scoring and super/outlier expert protection, and finally realigned the routers to cope with the reduced expert count ( a much smaller task for Step3.7 than Qwen models due to router not being pure softmax).

I have quantized into a custom quant format, based on Q4_NL but with much higher precision (think between Q5_K_XL and Q6_K_XL for accuracy), that runs in a modified vllm I maintain. The key difference is in how I calculate the group scalars vs how Q4_NL does it, my method preserves outlier weights without clipping and simultaneously minimizing damage to smaller group weights that get crushed on normal max value preservation logic.

Thus far, zero issues. Vision, needle in a haystack, thinking modulation, MTP, everything still working wonderfully. When I have more info, if the model is worth using, I'll publish a 240eprt NVFP4 version and a few GGUFs. It allows it to fit on a 128gb system with far fewer compromises on what is and what is not quantized than full fat 288. Early testing still but seeing 90% + MTP acceptance rates with MTP 1 (haven't tested further yet).

tcclaviger

about 9 hours ago

•

edited about 9 hours ago

Good news:
With the reaped and quantized version, I went even further than normal and put attention in FP8, and kv in FP8 to claw back some VRAM space...

On code needle test https://github.com/tcclaviger/codeneedle it scores 100% accuracy after checking the scoring (I need to adjust the scorer, it miss-scored a few of the lines), zero missed lines zero hallucinations, a few missed tool calls but they were failure to infer it should make the tool call, not actual failed tool calls.

Asking it to invoke the tools via chat, worked exactly as expect. So now I can go on to the "does this model actually perform or is it an accurate idiot" phase 🥳

tarruda

about 8 hours ago

LMK when you put some GGUFs out. So far I had no luck with any GGUFs, which always get stuck on infinite reasoning loop on a certain task/benchmark I have locally, though I only have 128G RAM and can only test 4-bit ggufs.

tcclaviger

about 8 hours ago

That's the target size I'm building for. Halo 395+ / spark / quad 32gb gpus.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment