nvidia
/

Nemotron-Flash-1B

Text Generation

Model card Files Files and versions

YongganFu commited on 13 days ago

Commit

fb5d4e1

·

verified ·

1 Parent(s): c50ca7b

Update README.md

Files changed (1) hide show

README.md +35 -0

README.md CHANGED Viewed

@@ -97,6 +97,41 @@ setattr(config, "attention_implementation_new", "flash_attention_2")
 model = AutoModelForCausalLM.from_pretrained(repo_name, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True)
 ```
 ## Citation
 ```
 @misc{fu2025nemotronflash,

 model = AutoModelForCausalLM.from_pretrained(repo_name, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True)
 ```
+## Running Nemotron-Flash with TensorRT-LLM
+### Setup
+Installation + quick start for TensorRT-LLM: <a href="https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html">Tutorial</a>.
+### Quick example
+An example script for running through the generation workflow:
+```
+cd examples/auto_deploy
+python build_and_run_ad.py --model nvidia/Nemotron-Flash-3B-Instruct --args.yaml-extra nemotron_flash.yaml
+```
+### Serving with trtllm-serve
+- Spin up a trtllm server (more details are in this <a href="https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve/trtllm-serve.html#starting-a-server">doc</a>):
+```
+trtllm-serve serve nvidia/Nemotron-Flash-3B-Instruct \
+--backend _autodeploy \
+--trust_remote_code \
+--extra_llm_api_options examples/auto_deploy/nemotron_flash.yaml
+```
+- Send a request (more details are in this <a href="https://nvidia.github.io/TensorRT-LLM/examples/curl_chat_client.html">doc</a>):
+```
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "nvidia/Nemotron-Flash-3B-Instruct",
+        "messages":[{"role": "user", "content": "Where is New York?"}],
+        "max_tokens": 16,
+        "temperature": 0
+    }'
+```
 ## Citation
 ```
 @misc{fu2025nemotronflash,