Instructions to use LLMWildling/gpt-oss-140b-ren-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LLMWildling/gpt-oss-140b-ren-2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="LLMWildling/gpt-oss-140b-ren-2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("LLMWildling/gpt-oss-140b-ren-2")
model = AutoModelForCausalLM.from_pretrained("LLMWildling/gpt-oss-140b-ren-2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use LLMWildling/gpt-oss-140b-ren-2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "LLMWildling/gpt-oss-140b-ren-2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-140b-ren-2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/LLMWildling/gpt-oss-140b-ren-2

SGLang

How to use LLMWildling/gpt-oss-140b-ren-2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "LLMWildling/gpt-oss-140b-ren-2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-140b-ren-2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "LLMWildling/gpt-oss-140b-ren-2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "LLMWildling/gpt-oss-140b-ren-2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use LLMWildling/gpt-oss-140b-ren-2 with Docker Model Runner:
```
docker model run hf.co/LLMWildling/gpt-oss-140b-ren-2
```

gpt-oss-140b Ren-2

gpt-oss-140b, codename Ren-2, is an agentic / SWE-oriented derivative of OpenAI GPT-OSS 120B.

This release takes the 120B base model and adds roughly 20B more parameters oriented toward agentic coding and SWE-style behavior.

Overview

Base model: openai/gpt-oss-120b
Release name: gpt-oss-140b Ren-2
Format: MXFP4
Intended use: coding, agentic coding, SWE-style assistant workflows
Status: research preview

Ren-2 is meant to feel like a more agentic version of GPT-OSS 120B rather than a generic continuation of the base checkpoint.

Training

Built on a custom framework
Roughly 3 hours of pre-training / post-training work for this release path
Expanded from the 120B base with roughly 20B additional parameters

This is an iterative open release. More sizes and follow-up revisions will come later.

Inference

This model was tested with:

vLLM 0.19.0

Recommended serving settings:

num_experts_per_tok=12
--reasoning-parser openai_gptoss
--tool-call-parser openai
--enable-auto-tool-choice

The original base setup used 4 active experts per token. Ren-2 is intended to run at 12 active experts per token.

Rough active-equivalent compute:

original top-k=4: about 5.7B active-equivalent params
Ren-2 top-k=12: about 12.9B active-equivalent params

These are approximate active-equivalent numbers, not total parameter counts.

In internal agentic-task traces at top-k=12, roughly half of active routing traffic ran through the added 20B expansion. Observed new-expert usage in those traces was about 48.6% of active expert selections and about 46.2% of routing mass. This is workload-dependent.

This release is intended to run directly from the baked model shards. No extra router merge step is required at inference time.

What It Is Good At

coding
agentic coding
SWE-style assistant behavior
practical tool-using workflows

Ren-2 is intended to be usable for production-style coding and agentic workflows, including terminal coding agents, SWE assistants, and tool-using automation setups.

Feedback

Useful feedback includes:

coding quality
tool use quality
long-context behavior
inference stability
preferred smaller sizes / VRAM targets

If you want smaller custom models, reach out with the hardware target and the kind of feedback you can provide.

It can be a different size or architecture, as long as the feedback loop is useful.

Included Files

config.json
generation_config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
model.safetensors.index.json
model-*.safetensors
README.md

License

Replace the placeholder license: other metadata with the actual license you want to publish under after confirming compatibility with the base model and your added weights.

Downloads last month: 26

Safetensors

Model size

143B params

Tensor type

BF16

F32

Model tree for LLMWildling/gpt-oss-140b-ren-2

Base model

openai/gpt-oss-120b

Quantized

(109)

this model