Update README.md
Browse files
README.md
CHANGED
|
@@ -97,6 +97,41 @@ setattr(config, "attention_implementation_new", "flash_attention_2")
|
|
| 97 |
model = AutoModelForCausalLM.from_pretrained(repo_name, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True)
|
| 98 |
```
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
## Citation
|
| 101 |
```
|
| 102 |
@misc{fu2025nemotronflash,
|
|
|
|
| 97 |
model = AutoModelForCausalLM.from_pretrained(repo_name, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True)
|
| 98 |
```
|
| 99 |
|
| 100 |
+
## Running Nemotron-Flash with TensorRT-LLM
|
| 101 |
+
|
| 102 |
+
### Setup
|
| 103 |
+
Installation + quick start for TensorRT-LLM: <a href="https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html">Tutorial</a>.
|
| 104 |
+
|
| 105 |
+
### Quick example
|
| 106 |
+
|
| 107 |
+
An example script for running through the generation workflow:
|
| 108 |
+
```
|
| 109 |
+
cd examples/auto_deploy
|
| 110 |
+
python build_and_run_ad.py --model nvidia/Nemotron-Flash-3B-Instruct --args.yaml-extra nemotron_flash.yaml
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
### Serving with trtllm-serve
|
| 114 |
+
|
| 115 |
+
- Spin up a trtllm server (more details are in this <a href="https://nvidia.github.io/TensorRT-LLM/commands/trtllm-serve/trtllm-serve.html#starting-a-server">doc</a>):
|
| 116 |
+
```
|
| 117 |
+
trtllm-serve serve nvidia/Nemotron-Flash-3B-Instruct \
|
| 118 |
+
--backend _autodeploy \
|
| 119 |
+
--trust_remote_code \
|
| 120 |
+
--extra_llm_api_options examples/auto_deploy/nemotron_flash.yaml
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
- Send a request (more details are in this <a href="https://nvidia.github.io/TensorRT-LLM/examples/curl_chat_client.html">doc</a>):
|
| 124 |
+
```
|
| 125 |
+
curl http://localhost:8000/v1/chat/completions \
|
| 126 |
+
-H "Content-Type: application/json" \
|
| 127 |
+
-d '{
|
| 128 |
+
"model": "nvidia/Nemotron-Flash-3B-Instruct",
|
| 129 |
+
"messages":[{"role": "user", "content": "Where is New York?"}],
|
| 130 |
+
"max_tokens": 16,
|
| 131 |
+
"temperature": 0
|
| 132 |
+
}'
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
## Citation
|
| 136 |
```
|
| 137 |
@misc{fu2025nemotronflash,
|