Instructions to use ENOT-AutoDL/gpt2-tensorrt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ENOT-AutoDL/gpt2-tensorrt with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ENOT-AutoDL/gpt2-tensorrt")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("ENOT-AutoDL/gpt2-tensorrt", dtype="auto")

TensorRT

How to use ENOT-AutoDL/gpt2-tensorrt with TensorRT:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ENOT-AutoDL/gpt2-tensorrt with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ENOT-AutoDL/gpt2-tensorrt"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ENOT-AutoDL/gpt2-tensorrt",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ENOT-AutoDL/gpt2-tensorrt

SGLang

How to use ENOT-AutoDL/gpt2-tensorrt with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ENOT-AutoDL/gpt2-tensorrt" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ENOT-AutoDL/gpt2-tensorrt",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ENOT-AutoDL/gpt2-tensorrt" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ENOT-AutoDL/gpt2-tensorrt",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ENOT-AutoDL/gpt2-tensorrt with Docker Model Runner:
```
docker model run hf.co/ENOT-AutoDL/gpt2-tensorrt
```

GPT2

This repository contains GPT2 onnx models compatible with TensorRT:

gpt2-xl.onnx - GPT2-XL onnx for fp32 or fp16 engines
gpt2-xl-i8.onnx - GPT2-XL onnx for int8+fp32 engines

Quantization of models was performed by the ENOT-AutoDL framework. Code for building of TensorRT engines and examples published on github.

Metrics:

GPT2-XL

	TensorRT INT8+FP32	torch FP16
Lambada Acc	72.11%	71.43%

Test environment

GPU RTX 4090
CPU 11th Gen Intel(R) Core(TM) i7-11700K
TensorRT 8.5.3.1
pytorch 1.13.1+cu116

Latency:

GPT2-XL

Input sequance length	Number of generated tokens	TensorRT INT8+FP32 ms	torch FP16 ms	Acceleration
64	64	462	1190	2.58
64	128	920	2360	2.54
64	256	1890	4710	2.54

Test environment

GPU RTX 4090
CPU 11th Gen Intel(R) Core(TM) i7-11700K
TensorRT 8.5.3.1
pytorch 1.13.1+cu116

How to use

Example of inference and accuracy test published on github:

git clone https://github.com/ENOT-AutoDL/ENOT-transformers

Downloads last month: -; Downloads are not tracked for this model. How to track

ENOT-AutoDL
/

gpt2-tensorrt

GPT2

Metrics:

GPT2-XL

Test environment

Latency:

GPT2-XL

Test environment

How to use

Dataset used to train ENOT-AutoDL/gpt2-tensorrt