SETA-RL: Qwen3-8B Fine-tuned with Reinforcement Learning
This model is Qwen3-8B fine-tuned with reinforcement learning (GRPO via the AReaL framework) on the SETA training dataset.
It is a checkpoint released alongside the paper SETA: Scaling Environments for Terminal Agents (anonymous submission under double-blind review).
Model Details
| Base model | Qwen/Qwen3-8B |
| Training method | GRPO (Group Relative Policy Optimization) |
| Training data | SETA-Synth — synthesized terminal-agent tasks |
| Reward function | pass_ratio_with_bonus (+0.5 bonus when all unit tests pass) |
| Context length | 32 768 tokens |
| Thinking | Disabled during training (/no_think) |
Intended Use
This model is designed for terminal agent tasks: completing multi-step shell-based tasks inside a Docker container environment using tools such as shell_exec, shell_view, and shell_write_content_to_file.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
For evaluation with the SETA framework, serve via SGLang and run:
python -m sglang.launch_server --model AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl --port 30000
python scripts/evaluation/eval.py \
--config scripts/evaluation/configs/eval_default_qwen3_8b.yaml \
terminal_env.model.model_type=AnonymousSubmissionUnderDouble-BlindRevi/seta-env-rl \
terminal_env.model.url=http://localhost:30000/v1 \
dataset=seta-env
Training Configuration
Key hyperparameters (see scripts/areal/configs/config_train_local_seta_env.yaml in the companion code repository):
| Hyperparameter | Value |
|---|---|
| Learning rate | 1.70e-5 |
| LR scheduler | constant |
| Optimizer | AdamW (β₁=0.9, β₂=0.999) |
| Weight decay | 0.017 |
| ε-clip | 0.4 |
| Reward scaling | 10.0 |
| Reward bias | −0.5 |
| KL coefficient | 0.0 |
| Trajectories per task | 16 |
| Total epochs | 40 |
| GPUs | 8 × (4 rollout + 2 trainer) |
Limitations
- Evaluated on terminal-agent benchmarks; performance on general language tasks is not characterized.
- The model operates without chain-of-thought reasoning (
/no_thinkmode).
License
Apache 2.0 (inherited from Qwen3-8B base).
- Downloads last month
- 26