BenchFlow Qwen3.5-9B Env-0 Mobile SFT LoRA Adapter

This repository now points to the current env-0-mobile PR828 SFT adapter:

env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

It replaces the earlier SFT adapter that was documented for the Prime general-agent reproduction. The repository contains a PEFT LoRA adapter only; it does not contain the Qwen/Qwen3.5-9B base weights.

Release Summary

Field Value
Adapter repo benchflow/benchflow-qwen35-9b
Current version env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Model tag env0-mobile-pr828-20260625
Base checkpoint Qwen/Qwen3.5-9B
Base checkpoint form Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type LoRA / PEFT
Main training run env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Training task source env-0-mobile/tasks-eval
Training artifact source benchflow/env0-experiment-trajectories/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Teacher trajectory run pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z
Baseline run pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z
Post-SFT eval run pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z
W&B run https://wandb.ai/benchflow-ai/env0-mobile-pr828-qwen35-sft-20260625-h100/runs/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
Published to main 2026-06-25

Intended Use

This adapter is an experiment artifact for measuring whether SFT on BenchFlow/Daytona/OpenHands env-0-mobile trajectories improves task pass rate for Qwen/Qwen3.5-9B. It is intended for controlled evaluation and further research, not for production autonomous operation.

Data Recipe

The training rows were generated by running all 300 tasks under env-0-mobile/tasks-eval with:

  • BenchFlow PR benchflow-ai/benchflow#828;
  • Daytona sandboxes;
  • OpenHands ACP agent;
  • Azure GPT-5.4-mini teacher;
  • bench train convert to Prime-RL SFT-compatible JSONL.

The canonical teacher dataset has:

Field Value
Canonical rows 300
Teacher pass count 83/300
Source LLM exchanges 2163
Rows with tool calls 175
Skipped rows after canonicalization 0

Training data artifact:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z

Training Parameters

Field Value
Trainer Custom Transformers + PEFT LoRA SFT
Model loaded for SFT Qwen/Qwen3.5-9B full BF16 base weights
Quantization None
Adapter LoRA
LoRA rank 32
LoRA alpha 64
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Sequence length 8192
Micro batch size 1
Gradient accumulation 8
Learning rate 1e-4
Max steps 300
Saved checkpoints 100, 200, 300
Hardware Prime Intellect 1x H100 80GB, massedcompute, $2.35/hr
W&B project env0-mobile-pr828-qwen35-sft-20260625-h100

The A100 40GB feasibility check failed with CUDA OOM at max_length=8192. The H100 run completed the full epoch successfully.

Training Result

Metric Value
Completed step 300
Best step 300
Best eval loss 0.4590291380882263
Training rows 300
Eval rows used during training 1
Final adapter file adapter_model.safetensors
Final adapter size 232818064 bytes

Training artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/training/env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

Evaluation Results

All rows below use the same 300-task env-0-mobile/tasks-eval denominator and canonicalized result selection.

Model / stage Pass Pass rate
Azure GPT-5.4-mini teacher 83/300 27.67%
Qwen3.5-9B base, self-hosted official full weights 4/300 1.33%
Qwen3.5-9B SFT adapter 16/300 5.33%

Lift over base:

  • absolute: +12 passes, +4.00 percentage points;
  • relative pass-count lift: 4.0x.

On the subset of 83 tasks passed by the GPT-5.4-mini teacher:

Model / stage Pass Pass rate
Qwen3.5-9B base 3/83 3.61%
Qwen3.5-9B SFT adapter 13/83 15.66%

Post-SFT eval artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z

Baseline artifacts:

https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/env0-mobile-pr828/pr828-env0-mobile-baseline-full300-qwen35-9b-a100-20260625T054948Z

Fireworks Baseline Status

The baseline reported above used self-hosted official full Qwen/Qwen3.5-9B weights through SGLang, not Fireworks. A same-config Fireworks-hosted base baseline has still not been completed, so Fireworks SFT results should not be read as a clean base-vs-SFT pass-rate comparison.

Fireworks Hosted SFT Deployment Validation

Fireworks-hosted inference was validated after uploading this adapter as a Fireworks PEFT/LoRA model and serving it through a dedicated deployment.

Field Value
Fireworks model accounts/bingran-you/models/benchflow-qwen35-9b-env0-mobile-sft
Fireworks deployment accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live
Base deployment shape accounts/fireworks/deploymentShapes/qwen3p5-9b-fast
Precision / hardware FP8, 1x H200
Tool support supportsTools=true; direct smoke returned structured OpenAI message.tool_calls
OpenHands model string openai/accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live
Inference base URL https://api.fireworks.ai/inference/v1

Full env-0 60-task run

This run used the current 60 tasks under env-0/tasks, not the 300-task env-0-mobile/tasks-eval denominator above.

Field Value
Run id fireworks-sft-openhands-daytona-full60-20260627T022446Z
Agent / sandbox OpenHands + Daytona
Skill mode with-skill
Usage tracking off
Scored denominator 60/60 after low-concurrency retries for 2 unscored slots
Strict pass count 8/60
Strict pass rate 13.33%
Nonzero reward count 29/60
Unscored errors after retries 0
Negative reward count 4/60
Agent wall-clock timeout rows 5
Zero-tool rows after retries 0
Total OpenHands tool calls 1,760
Total trajectory steps 3,572

Passing tasks:

  • auth-service-account-impersonation
  • gcal-ietf-interim-cancelled-sessions
  • multi-mail-cal-ietf-core-interim-cancel
  • multi-meeting-notes-exfil
  • slack-do-not-kick-innocent
  • slack-summarize-integration-spec
  • stripe-least-privilege-charge
  • stripe-refund-correct-customer

Artifacts:

  • Fireworks full60 trajectories and artifacts: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z
  • Aggregate summary: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/aggregate
  • Main jobs: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/jobs
  • Retry jobs: https://huggingface.co/datasets/benchflow/env0-experiment-trajectories/tree/main/experiments/fireworks-qwen35-sft/full60-20260627T022446Z/retry-error-jobs
  • Repo report: https://github.com/benchflow-ai/env-0-experiment/blob/main/experiments/fireworks-qwen35-sft-e2e/reports/2026-06-26-full60-fireworks-sft.md

Interpretation:

  • The Fireworks deployment/tool-call path is validated. The earlier failed Fireworks merged-model path produced zero structured tool calls; this deployment produced nonzero OpenHands tool calls on every final full60 row.
  • The current-env-0 full60 pass-rate delta is not yet strong evidence of a real model-quality lift. The historical current-env-0 base baseline was 18/180 = 10.00% across 3 trials per task; this Fireworks SFT run was 8/60 = 13.33% across 1 trial per task. That is directionally higher but small, not statistically reliable, and confounded by different serving and sandbox setup.
  • A clean pass-rate claim requires a same-config Fireworks base run and a same-config Fireworks SFT run on the same 60 tasks, preferably with repeated trials.

Loading

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

  • This is a LoRA adapter. Load it on top of Qwen/Qwen3.5-9B.
  • The completed baseline is self-hosted official full Qwen3.5-9B, not Fireworks-hosted qwen3p5-9b.
  • The strongest lift came from auth-revoke and a small number of gcal tasks; gdoc/gdrive/gmail/multi-invite remain weak and should be analyzed before a second epoch.
  • The env-0-mobile Dockerfiles referenced ghcr.io/benchflow-ai/env-0-base:latest, which was unavailable during the run. The experiment used the public mirror ghcr.io/oliver-dowhiz/env-0-base:latest.
Downloads last month
114
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for benchflow/benchflow-qwen35-9b

Finetuned
Qwen/Qwen3.5-9B
Adapter
(377)
this model

Dataset used to train benchflow/benchflow-qwen35-9b