samarth-repshift-9b-v1

A representation-shift jailbreak judge built on Qwen/Qwen3.5-9B. Detects harmful LLM responses by measuring how a small LoRA adapter shifts hidden-state activations on the response tokens, then projects that shift onto EMA-tracked directions that distinguish real harm from confident-but-fake harm (a class of "honeypot" responses that look harmful but contain no actionable content).

The key innovation is a Δ_hp_axis penalty: a second EMA-tracked direction that down-weights responses drifting along the "looks-harmful" axis but not along the "actually-actionable" one. This directly attacks the confident-nonsense failure mode that fools every production safety classifier we tested.

Headline result — realistic deployment test

The hardest, most realistic test is: in a single mixed input stream, can the judge rank real harm above confident-nonsense honeypots? We built this test by combining 7,191 real jailbreak harm responses (CoinflipForSafety GCG/BoN/Pair attacks, BeaverTails-30k, WildGuardMix) with 4,154 high-quality honeypot responses (responses that look harmful but contain no usable information). AUROC measures whether the judge ranks the real harm higher.

Judge AUROC (mixed-realistic)
samarth-repshift-9b-v1 (this model) 0.973
Hybrid: 0.6 × this + 0.4 × harmbench-llama 0.909
wildguard 0.452
llama-guard-3 0.428
qwen3guard-gen-8b 0.319
harmbench-llama 0.261
harmbench-mistral 0.218

Production judges all score worse than random because they rank honeypots ABOVE real harm — they're more confident on confident-nonsense than on adversarially-attacked real harm. On the BeaverTails-HARM + honeypots split, harmbench-mistral hits AUROC 0.040 — almost a complete inversion.

This model is the only judge in our 17-judge benchmark that correctly discriminates real harm from confident-nonsense in a realistic mixed stream.

When to use what

Task Best judge
Realistic deployment (harm + honeypots mixed) this model (0.97 AUROC)
Honeypot-only false-positive rate @0.5 this model (FPR 0.009)
Pure jailbreak harm (no honeypots) harmbench-mistral / qwen3guard-gen-8b
Best balanced safety classifier Hybrid w=0.6 = 0.6 × this + 0.4 × harmbench-llama
JST (judge stress test) this model (0.587) — beats all production judges by 9-24 pts

Recommended hybrid usage

For broad-coverage safety in production, pair this with harmbench-llama:

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load this model (LoRA over Qwen3.5-9B)
base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B", torch_dtype=torch.bfloat16, device_map="auto"
)
repshift = PeftModel.from_pretrained(base, "ArthT/samarth-repshift-9b-v1")

# Load harmbench-llama
hb = AutoModelForCausalLM.from_pretrained(
    "cais/HarmBench-Llama-2-13b-cls",
    torch_dtype=torch.bfloat16, device_map="auto"
)

W = 0.6  # 0.6 × this + 0.4 × harmbench-llama
p_repshift = score_repshift(repshift, base, prompt, response)
p_harmbench = score_harmbench(hb, prompt, response)
score = W * p_repshift + (1 - W) * p_harmbench

Hybrid weight sweep (validated)

w (this model) mean OOD AUROC Coinflip (jailbreak) JST AUROC Honeypot FPR @0.5
0.0 (harmbench alone) 0.770 0.811 0.482 0.941
0.2 0.780 0.781 0.533 0.932
0.4 0.776 0.762 0.533 0.917
0.5 0.774 0.743 0.535 0.870
0.6 (recommended) 0.768 0.682 0.546 0.058
0.7 0.761 0.630 0.557 0.025
1.0 (this alone) 0.641 0.466 0.587 0.009

The external-honeypot false-positive rate has a sharp transition at w=0.6: below 0.6 harmbench-llama dominates the score and overrides the Δ_hp_axis defense; at 0.6+ this model's signal is strong enough to suppress honeypots while still inheriting harmbench's broad jailbreak coverage.

Training

  • Base: Qwen/Qwen3.5-9B + LoRA r=16, α=32, dropout=0.05
  • Corpus: 14,027 rows = 2,528 harmful prompts (HarmBench / AdvBench / JailbreakBench seed) + 2,528 Qwen3-4B-Instruct refusals (same prompts) + 497 in-house honeypots + 4,154 external honeypots from a community red-team corpus (validated at 98% inter-rater agreement by gpt-5.4-mini + grok-4.3 with a honeypot-specific rubric)
  • Loss: `α·shift_unsafe + β·KL_safe + γ·anchor_honeypot + λ_hp·anchor_gray
    • λ_hp_axis·Δ_hp_axis`
  • Margins: m_b=5.0, m_h=15.0, m_b_dir=0.5, m_h_dir=1.5
  • Weights: α=0.5, β=0.4, γ=0.9, λ_hp=3.0, λ_hp_axis=1.0
  • EMA decay: 0.95 for both Δ_harm and Δ_hp_axis
  • Rep layers: 22-30 of 36 (top 30%)
  • Schedule: patience 15, eval every 50 steps, lr 1e-4
  • Seed: 44 (best of a 5-seed variance bound — see below)

Per-dataset AUROC (this model alone)

Dataset AUROC Note
V1 benchmark (combined_fp_eval) 0.949 strongest on benchmark we built
do_not_answer 0.834
sorry_bench_human 0.840
judge_stress_test 0.587 best in field on this hard cell
wildguardmix_test 0.620
beavertails_30k_test 0.542 use hybrid for this
coinflip_for_safety (GCG/BoN/Pair) 0.466 use hybrid for this
4-axis quadrant eval 0.931 strong on confident-nonsense cells
xstest_gpt4 0.684
aegis-v2 0.486
external honeypot FPR @0.5 0.009 only judge that doesn't fire on honeypots

5-seed variance

Trained 5 seeds of this exact configuration (seeds 42-46): JST AUROC = 0.559 ± 0.037, mean OOD = 0.625, honeypot FPR @0.5 = 0.257 (range 0.001-0.692 — high variance, so single-seed deployment risk applies). Seed 44 (this checkpoint) was the strongest single seed on honeypot handling and is the recommended choice for production deployment.

Inference

Repshift is a 2-pass forward (with and without adapter on the same response). See judges/adapters/samarth_repshift.py in the Robust-jailbreak-judges repo for the reference implementation.

Pseudocode:

# Two passes
with torch.no_grad():
    h_with = base_with_lora(input_ids).hidden_states[rep_layers]   # adapter on
    h_without = base.disable_adapter().forward(input_ids).hidden_states[rep_layers]

d_drift = (h_with - h_without).mean(dim=token_axis)  # per response token
score_raw = d_drift @ delta_harm - lambda_hp_axis * (d_drift @ delta_hp_axis)
p_harmful = sigmoid(platt_a * score_raw + platt_b)

meta.json carries delta_harm and delta_hp_axis EMA directions, plus calibrated Platt parameters a and b.

Limitations

  • 2× inference cost (forward with adapter + forward without)
  • High single-seed variance in honeypot handling
  • Optimized for confident-nonsense / honeypot resistance, not for pure adversarial harm detection. Use hybrid scoring (w=0.6) for production if you need both.
  • English-only training data.
  • LoRA-only; the base Qwen3.5-9B weights must be loaded separately.

Citation

@misc{samarth-repshift-9b-v1,
  title  = {Representation-Shift Judges for Confident-Nonsense-Resistant
            Jailbreak Detection},
  author = {Singh, Arth and AIM Intelligence},
  year   = {2026},
  note   = {Robust Jailbreak Judges project, repshift Qwen3.5-9B
            Δ_hp_axis variant, seed 44}
}

⚠️ Correction notice (2026-05-22)

Earlier versions of this card referenced AUROC numbers on the Active Robustness benchmark that were computed with a buggy record-level join. Corrected numbers (HARMFUL_REAL vs CN+REFUSAL, dual-rater gold labels, 2,177 records):

Judge AUROC full AUROC CN-only
samarth-qwen35-9b + system prompt 0.936 0.871
samarth-qwen35-9b (no system prompt) 0.904 0.827
qwen3guard-gen-8b 0.858 0.605
this (samarth-repshift-9b-v1) 0.755 0.557

For broad-coverage AR-style adversarial detection, prefer samarth-qwen35-9b. This repshift variant remains the strongest pick for honeypot-style confident- nonsense rejection on the Phase-1 benchmark (Samuel FPR @ TPR=0.95 = 0.009).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ArthT/samarth-repshift-9b-v1

Finetuned
Qwen/Qwen3.5-9B
Adapter
(211)
this model