Instructions to use benchflow/benchflow-qwen35-9b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use benchflow/benchflow-qwen35-9b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B") model = PeftModel.from_pretrained(base_model, "benchflow/benchflow-qwen35-9b") - Notebooks
- Google Colab
- Kaggle
BenchFlow Qwen3.5-9B Env-0 Mobile SFT LoRA Adapter
This repository now points to the current env-0-mobile PR828 SFT adapter:
env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
It replaces the earlier SFT adapter that was documented for the Prime
general-agent reproduction. The repository contains a PEFT LoRA adapter only;
it does not contain the Qwen/Qwen3.5-9B base weights.
Release Summary
| Field | Value |
|---|---|
| Adapter repo | benchflow/benchflow-qwen35-9b |
| Current version | env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Model tag | env0-mobile-pr828-20260625 |
| Base checkpoint | Qwen/Qwen3.5-9B |
| Base checkpoint form | Full, non-quantized source checkpoint; frozen during LoRA SFT |
| Adapter type | LoRA / PEFT |
| Main training run | env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Training task source | env-0-mobile/tasks-eval |
| Training artifact source | benchflow/env0-experiment-trajectories/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Teacher trajectory run | pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z |
| Baseline run | pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z |
| Post-SFT eval run | pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z |
| W&B run | https://wandb.ai/benchflow-ai/env0-mobile-pr828-qwen35-sft-20260625-h100/runs/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z |
| Published to main | 2026-06-25 |
Intended Use
This adapter is an experiment artifact for measuring whether SFT on
BenchFlow/Daytona/OpenHands env-0-mobile trajectories improves task pass rate
for Qwen/Qwen3.5-9B. It is intended for controlled evaluation and further
research, not for production autonomous operation.
Data Recipe
The training rows were generated by running all 300 tasks under
env-0-mobile/tasks-eval with:
- BenchFlow PR
benchflow-ai/benchflow#828; - Daytona sandboxes;
- OpenHands ACP agent;
- Azure GPT-5.4-mini teacher;
bench train convertto Prime-RL SFT-compatible JSONL.
The canonical teacher dataset has:
| Field | Value |
|---|---|
| Canonical rows | 300 |
| Teacher pass count | 83/300 |
| Source LLM exchanges | 2163 |
| Rows with tool calls | 175 |
| Skipped rows after canonicalization | 0 |
Training data artifact:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z
Training Parameters
| Field | Value |
|---|---|
| Trainer | Custom Transformers + PEFT LoRA SFT |
| Model loaded for SFT | Qwen/Qwen3.5-9B full BF16 base weights |
| Quantization | None |
| Adapter | LoRA |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Sequence length | 8192 |
| Micro batch size | 1 |
| Gradient accumulation | 8 |
| Learning rate | 1e-4 |
| Max steps | 300 |
| Saved checkpoints | 100, 200, 300 |
| Hardware | Prime Intellect 1x H100 80GB, massedcompute, $2.35/hr |
| W&B project | env0-mobile-pr828-qwen35-sft-20260625-h100 |
The A100 40GB feasibility check failed with CUDA OOM at max_length=8192.
The H100 run completed the full epoch successfully.
Training Result
| Metric | Value |
|---|---|
| Completed step | 300 |
| Best step | 300 |
| Best eval loss | 0.4590291380882263 |
| Training rows | 300 |
| Eval rows used during training | 1 |
| Final adapter file | adapter_model.safetensors |
| Final adapter size | 232818064 bytes |
Training artifacts:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Evaluation Results
All rows below use the same 300-task env-0-mobile/tasks-eval denominator and
canonicalized result selection.
| Model / stage | Pass | Pass rate |
|---|---|---|
| Azure GPT-5.4-mini teacher | 83/300 |
27.67% |
| Qwen3.5-9B base, self-hosted official full weights | 4/300 |
1.33% |
| Qwen3.5-9B SFT adapter | 16/300 |
5.33% |
Lift over base:
- absolute:
+12passes,+4.00percentage points; - relative pass-count lift:
4.0x.
On the subset of 83 tasks passed by the GPT-5.4-mini teacher:
| Model / stage | Pass | Pass rate |
|---|---|---|
| Qwen3.5-9B base | 3/83 |
3.61% |
| Qwen3.5-9B SFT adapter | 13/83 |
15.66% |
Post-SFT eval artifacts:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z
Baseline artifacts:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z
Fireworks Baseline Status
The baseline reported above used self-hosted official full Qwen/Qwen3.5-9B
weights through SGLang, not Fireworks. A same-config Fireworks-hosted base
baseline has still not been completed, so Fireworks SFT results should not be
read as a clean base-vs-SFT pass-rate comparison.
Fireworks Hosted SFT Deployment Validation
Fireworks-hosted inference was validated after uploading this adapter as a Fireworks PEFT/LoRA model and serving it through a dedicated deployment.
| Field | Value |
|---|---|
| Fireworks model | accounts/bingran-you/models/benchflow-qwen35-9b-env0-mobile-sft |
| Fireworks deployment | accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live |
| Base deployment shape | accounts/fireworks/deploymentShapes/qwen3p5-9b-fast |
| Precision / hardware | FP8, 1x H200 |
| Tool support | supportsTools=true; direct smoke returned structured OpenAI message.tool_calls |
| OpenHands model string | openai/accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live |
| Inference base URL | https://api.fireworks.ai/inference/v1 |
Full env-0 60-task run
This run used the current 60 tasks under env-0/tasks, not the 300-task
env-0-mobile/tasks-eval denominator above.
| Field | Value |
|---|---|
| Run id | fireworks-sft-openhands-daytona-full60-20260627T022446Z |
| Agent / sandbox | OpenHands + Daytona |
| Skill mode | with-skill |
| Usage tracking | off |
| Scored denominator | 60/60 after low-concurrency retries for 2 unscored slots |
| Strict pass count | 8/60 |
| Strict pass rate | 13.33% |
| Nonzero reward count | 29/60 |
| Unscored errors after retries | 0 |
| Negative reward count | 4/60 |
| Agent wall-clock timeout rows | 5 |
| Zero-tool rows after retries | 0 |
| Total OpenHands tool calls | 1,760 |
| Total trajectory steps | 3,572 |
Passing tasks:
auth-service-account-impersonationgcal-ietf-interim-cancelled-sessionsmulti-mail-cal-ietf-core-interim-cancelmulti-meeting-notes-exfilslack-do-not-kick-innocentslack-summarize-integration-specstripe-least-privilege-chargestripe-refund-correct-customer
Artifacts:
- Fireworks full60 trajectories and artifacts:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z - Aggregate summary:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/aggregate - Main jobs:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/jobs - Retry jobs:
https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/retry-error-jobs - Repo report:
https://github.com/benchflow-ai/env-0-experiment/blob/main/experiments/fireworks-qwen35-sft-e2e/reports/2026-06-26-full60-fireworks-sft.md
Interpretation:
- The Fireworks deployment/tool-call path is validated. The earlier failed Fireworks merged-model path produced zero structured tool calls; this deployment produced nonzero OpenHands tool calls on every final full60 row.
- The current-env-0 full60 pass-rate delta is not yet strong evidence of a
real model-quality lift. The historical current-env-0 base baseline was
18/180 = 10.00%across 3 trials per task; this Fireworks SFT run was8/60 = 13.33%across 1 trial per task. That is directionally higher but small, not statistically reliable, and confounded by different serving and sandbox setup. - A clean pass-rate claim requires a same-config Fireworks base run and a same-config Fireworks SFT run on the same 60 tasks, preferably with repeated trials.
Loading
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)
Caveats
- This is a LoRA adapter. Load it on top of
Qwen/Qwen3.5-9B. - The completed baseline is self-hosted official full Qwen3.5-9B, not
Fireworks-hosted
qwen3p5-9b. - The strongest lift came from auth-revoke and a small number of gcal tasks; gdoc/gdrive/gmail/multi-invite remain weak and should be analyzed before a second epoch.
- The env-0-mobile Dockerfiles referenced
ghcr.io/benchflow-ai/env-0-base:latest, which was unavailable during the run. The experiment used the public mirrorghcr.io/oliver-dowhiz/env-0-base:latest.
- Downloads last month
- 114