Instructions to use stepfun-ai/Step-3.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use stepfun-ai/Step-3.7-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="stepfun-ai/Step-3.7-Flash", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.7-Flash", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use stepfun-ai/Step-3.7-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "stepfun-ai/Step-3.7-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/stepfun-ai/Step-3.7-Flash
- SGLang
How to use stepfun-ai/Step-3.7-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "stepfun-ai/Step-3.7-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "stepfun-ai/Step-3.7-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use stepfun-ai/Step-3.7-Flash with Docker Model Runner:
docker model run hf.co/stepfun-ai/Step-3.7-Flash
Supers Found in model
how about coding performence?
Not sure yet, still processing the model :P
I don't have a lot of hope it'll be beating Qwen3.6-27B or Gemma4-31b, but maybe?
@tcclaviger can you share how you obtain that super expert table? While I like this model in the API, I found that ~4-bit GGUF quants can result in infinite reasoning loops. Might try quantizing while keeping these super expert layers at higher precisions.
I ran a full REAP dataset through it by patching the REAP repo tools to work with it Step3.7. Split was 0.2/0.3/0.5 for math/agentic tools/coding datasets to get the activations using consistent size/sampling per the REAP paper.
Then I cut it down to 240 experts from 288 using REAP activation output/frequency scoring and super/outlier expert protection, and finally realigned the routers to cope with the reduced expert count ( a much smaller task for Step3.7 than Qwen models due to router not being pure softmax).
I have quantized into a custom quant format, based on Q4_NL but with much higher precision (think between Q5_K_XL and Q6_K_XL for accuracy), that runs in a modified vllm I maintain. The key difference is in how I calculate the group scalars vs how Q4_NL does it, my method preserves outlier weights without clipping and simultaneously minimizing damage to smaller group weights that get crushed on normal max value preservation logic.
Thus far, zero issues. Vision, needle in a haystack, thinking modulation, MTP, everything still working wonderfully. When I have more info, if the model is worth using, I'll publish a 240eprt NVFP4 version and a few GGUFs. It allows it to fit on a 128gb system with far fewer compromises on what is and what is not quantized than full fat 288. Early testing still but seeing 90% + MTP acceptance rates with MTP 1 (haven't tested further yet).
Good news:
With the reaped and quantized version, I went even further than normal and put attention in FP8, and kv in FP8 to claw back some VRAM space...
On code needle test https://github.com/tcclaviger/codeneedle it scores 100% accuracy after checking the scoring (I need to adjust the scorer, it miss-scored a few of the lines), zero missed lines zero hallucinations, a few missed tool calls but they were failure to infer it should make the tool call, not actual failed tool calls.
Asking it to invoke the tools via chat, worked exactly as expect. So now I can go on to the "does this model actually perform or is it an accurate idiot" phase 🥳
LMK when you put some GGUFs out. So far I had no luck with any GGUFs, which always get stuck on infinite reasoning loop on a certain task/benchmark I have locally, though I only have 128G RAM and can only test 4-bit ggufs.
That's the target size I'm building for. Halo 395+ / spark / quad 32gb gpus.

