BenchFlow Qwen3.5-9B Env-0 Mobile SFT LoRA Adapter

This repository now points to the current env-0-mobile PR828 SFT adapter:

env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

It replaces the earlier SFT adapter that was documented for the Prime general-agent reproduction. The repository contains a PEFT LoRA adapter only; it does not contain the Qwen/Qwen3.5-9B base weights.

Release Summary

Field	Value
Adapter repo	`benchflow/benchflow-qwen35-9b`
Current version	`env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Model tag	`env0-mobile-pr828-20260625`
Base checkpoint	`Qwen/Qwen3.5-9B`
Base checkpoint form	Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type	LoRA / PEFT
Main training run	`env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Training task source	`env-0-mobile/tasks-eval`
Training artifact source	`benchflow/env0-experiment-trajectories/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Teacher trajectory run	`pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z`
Baseline run	`pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z`
Post-SFT eval run	`pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z`
W&B run	`https://wandb.ai/benchflow-ai/env0-mobile-pr828-qwen35-sft-20260625-h100/runs/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z`
Published to main	`2026-06-25`

Intended Use

This adapter is an experiment artifact for measuring whether SFT on BenchFlow/Daytona/OpenHands env-0-mobile trajectories improves task pass rate for Qwen/Qwen3.5-9B. It is intended for controlled evaluation and further research, not for production autonomous operation.

Data Recipe

The training rows were generated by running all 300 tasks under env-0-mobile/tasks-eval with:

BenchFlow PR benchflow-ai/benchflow#828;
Daytona sandboxes;
OpenHands ACP agent;
Azure GPT-5.4-mini teacher;
bench train convert to Prime-RL SFT-compatible JSONL.

The canonical teacher dataset has:

Field	Value
Canonical rows	`300`
Teacher pass count	`83/300`
Source LLM exchanges	`2163`
Rows with tool calls	`175`
Skipped rows after canonicalization	`0`

Training data artifact:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z

Training Parameters

Field	Value
Trainer	Custom Transformers + PEFT LoRA SFT
Model loaded for SFT	`Qwen/Qwen3.5-9B` full BF16 base weights
Quantization	None
Adapter	LoRA
LoRA rank	`32`
LoRA alpha	`64`
LoRA dropout	`0.05`
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Sequence length	`8192`
Micro batch size	`1`
Gradient accumulation	`8`
Learning rate	`1e-4`
Max steps	`300`
Saved checkpoints	`100`, `200`, `300`
Hardware	Prime Intellect 1x H100 80GB, `massedcompute`, `$2.35/hr`
W&B project	`env0-mobile-pr828-qwen35-sft-20260625-h100`

The A100 40GB feasibility check failed with CUDA OOM at max_length=8192. The H100 run completed the full epoch successfully.

Training Result

Metric	Value
Completed step	`300`
Best step	`300`
Best eval loss	`0.4590291380882263`
Training rows	`300`
Eval rows used during training	`1`
Final adapter file	`adapter_model.safetensors`
Final adapter size	`232818064` bytes

Training artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

Evaluation Results

All rows below use the same 300-task env-0-mobile/tasks-eval denominator and canonicalized result selection.

Model / stage	Pass	Pass rate
Azure GPT-5.4-mini teacher	`83/300`	`27.67%`
Qwen3.5-9B base, self-hosted official full weights	`4/300`	`1.33%`
Qwen3.5-9B SFT adapter	`16/300`	`5.33%`

Lift over base:

absolute: +12 passes, +4.00 percentage points;
relative pass-count lift: 4.0x.

On the subset of 83 tasks passed by the GPT-5.4-mini teacher:

Model / stage	Pass	Pass rate
Qwen3.5-9B base	`3/83`	`3.61%`
Qwen3.5-9B SFT adapter	`13/83`	`15.66%`

Post-SFT eval artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z

Baseline artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z

Fireworks Baseline Status

The baseline reported above used self-hosted official full Qwen/Qwen3.5-9B weights through SGLang, not Fireworks. A same-config Fireworks-hosted base baseline has still not been completed, so Fireworks SFT results should not be read as a clean base-vs-SFT pass-rate comparison.

Fireworks Hosted SFT Deployment Validation

Fireworks-hosted inference was validated after uploading this adapter as a Fireworks PEFT/LoRA model and serving it through a dedicated deployment.

Field	Value
Fireworks model	`accounts/bingran-you/models/benchflow-qwen35-9b-env0-mobile-sft`
Fireworks deployment	`accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live`
Base deployment shape	`accounts/fireworks/deploymentShapes/qwen3p5-9b-fast`
Precision / hardware	FP8, 1x H200
Tool support	`supportsTools=true`; direct smoke returned structured OpenAI `message.tool_calls`
OpenHands model string	`openai/accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live`
Inference base URL	`https://api.fireworks.ai/inference/v1`

Full env-0 60-task run

This run used the current 60 tasks under env-0/tasks, not the 300-task env-0-mobile/tasks-eval denominator above.

Field	Value
Run id	`fireworks-sft-openhands-daytona-full60-20260627T022446Z`
Agent / sandbox	OpenHands + Daytona
Skill mode	`with-skill`
Usage tracking	`off`
Scored denominator	`60/60` after low-concurrency retries for 2 unscored slots
Strict pass count	`8/60`
Strict pass rate	`13.33%`
Nonzero reward count	`29/60`
Unscored errors after retries	`0`
Negative reward count	`4/60`
Agent wall-clock timeout rows	`5`
Zero-tool rows after retries	`0`
Total OpenHands tool calls	`1,760`
Total trajectory steps	`3,572`

Passing tasks:

auth-service-account-impersonation
gcal-ietf-interim-cancelled-sessions
multi-mail-cal-ietf-core-interim-cancel
multi-meeting-notes-exfil
slack-do-not-kick-innocent
slack-summarize-integration-spec
stripe-least-privilege-charge
stripe-refund-correct-customer

Artifacts:

Fireworks full60 trajectories and artifacts: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z
Aggregate summary: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/aggregate
Main jobs: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/jobs
Retry jobs: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/retry-error-jobs
Repo report: https://github.com/benchflow-ai/env-0-experiment/blob/main/experiments/fireworks-qwen35-sft-e2e/reports/2026-06-26-full60-fireworks-sft.md

Interpretation:

The Fireworks deployment/tool-call path is validated. The earlier failed Fireworks merged-model path produced zero structured tool calls; this deployment produced nonzero OpenHands tool calls on every final full60 row.
The current-env-0 full60 pass-rate delta is not yet strong evidence of a real model-quality lift. The historical current-env-0 base baseline was 18/180 = 10.00% across 3 trials per task; this Fireworks SFT run was 8/60 = 13.33% across 1 trial per task. That is directionally higher but small, not statistically reliable, and confounded by different serving and sandbox setup.
A clean pass-rate claim requires a same-config Fireworks base run and a same-config Fireworks SFT run on the same 60 tasks, preferably with repeated trials.

Loading

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

This is a LoRA adapter. Load it on top of Qwen/Qwen3.5-9B.
The completed baseline is self-hosted official full Qwen3.5-9B, not Fireworks-hosted qwen3p5-9b.
The strongest lift came from auth-revoke and a small number of gcal tasks; gdoc/gdrive/gmail/multi-invite remain weak and should be analyzed before a second epoch.
The env-0-mobile Dockerfiles referenced ghcr.io/benchflow-ai/env-0-base:latest, which was unavailable during the run. The experiment used the public mirror ghcr.io/oliver-dowhiz/env-0-base:latest.

Downloads last month: 114

Model tree for benchflow/benchflow-qwen35-9b

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(377)

this model

benchflow
/

benchflow-qwen35-9b