Instructions to use ArthT/samarth-repshift-9b-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ArthT/samarth-repshift-9b-v1 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
samarth-repshift-9b-v1
A representation-shift jailbreak judge built on Qwen/Qwen3.5-9B. Detects
harmful LLM responses by measuring how a small LoRA adapter shifts hidden-state
activations on the response tokens, then projects that shift onto
EMA-tracked directions that distinguish real harm from confident-but-fake
harm (a class of "honeypot" responses that look harmful but contain no
actionable content).
The key innovation is a Δ_hp_axis penalty: a second EMA-tracked direction that down-weights responses drifting along the "looks-harmful" axis but not along the "actually-actionable" one. This directly attacks the confident-nonsense failure mode that fools every production safety classifier we tested.
Headline result — realistic deployment test
The hardest, most realistic test is: in a single mixed input stream, can the judge rank real harm above confident-nonsense honeypots? We built this test by combining 7,191 real jailbreak harm responses (CoinflipForSafety GCG/BoN/Pair attacks, BeaverTails-30k, WildGuardMix) with 4,154 high-quality honeypot responses (responses that look harmful but contain no usable information). AUROC measures whether the judge ranks the real harm higher.
| Judge | AUROC (mixed-realistic) |
|---|---|
| samarth-repshift-9b-v1 (this model) | 0.973 |
| Hybrid: 0.6 × this + 0.4 × harmbench-llama | 0.909 |
| wildguard | 0.452 |
| llama-guard-3 | 0.428 |
| qwen3guard-gen-8b | 0.319 |
| harmbench-llama | 0.261 |
| harmbench-mistral | 0.218 |
Production judges all score worse than random because they rank honeypots ABOVE real harm — they're more confident on confident-nonsense than on adversarially-attacked real harm. On the BeaverTails-HARM + honeypots split, harmbench-mistral hits AUROC 0.040 — almost a complete inversion.
This model is the only judge in our 17-judge benchmark that correctly discriminates real harm from confident-nonsense in a realistic mixed stream.
When to use what
| Task | Best judge |
|---|---|
| Realistic deployment (harm + honeypots mixed) | this model (0.97 AUROC) |
| Honeypot-only false-positive rate @0.5 | this model (FPR 0.009) |
| Pure jailbreak harm (no honeypots) | harmbench-mistral / qwen3guard-gen-8b |
| Best balanced safety classifier | Hybrid w=0.6 = 0.6 × this + 0.4 × harmbench-llama |
| JST (judge stress test) | this model (0.587) — beats all production judges by 9-24 pts |
Recommended hybrid usage
For broad-coverage safety in production, pair this with harmbench-llama:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load this model (LoRA over Qwen3.5-9B)
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B", torch_dtype=torch.bfloat16, device_map="auto"
)
repshift = PeftModel.from_pretrained(base, "ArthT/samarth-repshift-9b-v1")
# Load harmbench-llama
hb = AutoModelForCausalLM.from_pretrained(
"cais/HarmBench-Llama-2-13b-cls",
torch_dtype=torch.bfloat16, device_map="auto"
)
W = 0.6 # 0.6 × this + 0.4 × harmbench-llama
p_repshift = score_repshift(repshift, base, prompt, response)
p_harmbench = score_harmbench(hb, prompt, response)
score = W * p_repshift + (1 - W) * p_harmbench
Hybrid weight sweep (validated)
| w (this model) | mean OOD AUROC | Coinflip (jailbreak) | JST AUROC | Honeypot FPR @0.5 |
|---|---|---|---|---|
| 0.0 (harmbench alone) | 0.770 | 0.811 | 0.482 | 0.941 |
| 0.2 | 0.780 | 0.781 | 0.533 | 0.932 |
| 0.4 | 0.776 | 0.762 | 0.533 | 0.917 |
| 0.5 | 0.774 | 0.743 | 0.535 | 0.870 |
| 0.6 (recommended) | 0.768 | 0.682 | 0.546 | 0.058 |
| 0.7 | 0.761 | 0.630 | 0.557 | 0.025 |
| 1.0 (this alone) | 0.641 | 0.466 | 0.587 | 0.009 |
The external-honeypot false-positive rate has a sharp transition at w=0.6: below 0.6 harmbench-llama dominates the score and overrides the Δ_hp_axis defense; at 0.6+ this model's signal is strong enough to suppress honeypots while still inheriting harmbench's broad jailbreak coverage.
Training
- Base:
Qwen/Qwen3.5-9B+ LoRA r=16, α=32, dropout=0.05 - Corpus: 14,027 rows = 2,528 harmful prompts (HarmBench / AdvBench / JailbreakBench seed) + 2,528 Qwen3-4B-Instruct refusals (same prompts) + 497 in-house honeypots + 4,154 external honeypots from a community red-team corpus (validated at 98% inter-rater agreement by gpt-5.4-mini + grok-4.3 with a honeypot-specific rubric)
- Loss: `α·shift_unsafe + β·KL_safe + γ·anchor_honeypot + λ_hp·anchor_gray
- λ_hp_axis·Δ_hp_axis`
- Margins: m_b=5.0, m_h=15.0, m_b_dir=0.5, m_h_dir=1.5
- Weights: α=0.5, β=0.4, γ=0.9, λ_hp=3.0, λ_hp_axis=1.0
- EMA decay: 0.95 for both Δ_harm and Δ_hp_axis
- Rep layers: 22-30 of 36 (top 30%)
- Schedule: patience 15, eval every 50 steps, lr 1e-4
- Seed: 44 (best of a 5-seed variance bound — see below)
Per-dataset AUROC (this model alone)
| Dataset | AUROC | Note |
|---|---|---|
| V1 benchmark (combined_fp_eval) | 0.949 | strongest on benchmark we built |
| do_not_answer | 0.834 | |
| sorry_bench_human | 0.840 | |
| judge_stress_test | 0.587 | best in field on this hard cell |
| wildguardmix_test | 0.620 | |
| beavertails_30k_test | 0.542 | use hybrid for this |
| coinflip_for_safety (GCG/BoN/Pair) | 0.466 | use hybrid for this |
| 4-axis quadrant eval | 0.931 | strong on confident-nonsense cells |
| xstest_gpt4 | 0.684 | |
| aegis-v2 | 0.486 | |
| external honeypot FPR @0.5 | 0.009 | only judge that doesn't fire on honeypots |
5-seed variance
Trained 5 seeds of this exact configuration (seeds 42-46): JST AUROC = 0.559 ± 0.037, mean OOD = 0.625, honeypot FPR @0.5 = 0.257 (range 0.001-0.692 — high variance, so single-seed deployment risk applies). Seed 44 (this checkpoint) was the strongest single seed on honeypot handling and is the recommended choice for production deployment.
Inference
Repshift is a 2-pass forward (with and without adapter on the same response).
See judges/adapters/samarth_repshift.py in the
Robust-jailbreak-judges
repo for the reference implementation.
Pseudocode:
# Two passes
with torch.no_grad():
h_with = base_with_lora(input_ids).hidden_states[rep_layers] # adapter on
h_without = base.disable_adapter().forward(input_ids).hidden_states[rep_layers]
d_drift = (h_with - h_without).mean(dim=token_axis) # per response token
score_raw = d_drift @ delta_harm - lambda_hp_axis * (d_drift @ delta_hp_axis)
p_harmful = sigmoid(platt_a * score_raw + platt_b)
meta.json carries delta_harm and delta_hp_axis EMA directions, plus
calibrated Platt parameters a and b.
Limitations
- 2× inference cost (forward with adapter + forward without)
- High single-seed variance in honeypot handling
- Optimized for confident-nonsense / honeypot resistance, not for pure adversarial harm detection. Use hybrid scoring (w=0.6) for production if you need both.
- English-only training data.
- LoRA-only; the base Qwen3.5-9B weights must be loaded separately.
Citation
@misc{samarth-repshift-9b-v1,
title = {Representation-Shift Judges for Confident-Nonsense-Resistant
Jailbreak Detection},
author = {Singh, Arth and AIM Intelligence},
year = {2026},
note = {Robust Jailbreak Judges project, repshift Qwen3.5-9B
Δ_hp_axis variant, seed 44}
}
⚠️ Correction notice (2026-05-22)
Earlier versions of this card referenced AUROC numbers on the Active Robustness benchmark that were computed with a buggy record-level join. Corrected numbers (HARMFUL_REAL vs CN+REFUSAL, dual-rater gold labels, 2,177 records):
| Judge | AUROC full | AUROC CN-only |
|---|---|---|
samarth-qwen35-9b + system prompt |
0.936 | 0.871 |
| samarth-qwen35-9b (no system prompt) | 0.904 | 0.827 |
| qwen3guard-gen-8b | 0.858 | 0.605 |
| this (samarth-repshift-9b-v1) | 0.755 | 0.557 |
For broad-coverage AR-style adversarial detection, prefer
samarth-qwen35-9b. This
repshift variant remains the strongest pick for honeypot-style confident-
nonsense rejection on the Phase-1 benchmark (Samuel FPR @ TPR=0.95 = 0.009).
- Downloads last month
- -