Instructions to use ArthT/samarth-icebreaker-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ArthT/samarth-icebreaker-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="ArthT/samarth-icebreaker-v1")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ArthT/samarth-icebreaker-v1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
samarth-icebreaker-v1
A frozen-base MLP probe that classifies (prompt, response) pairs as
harmful or safe β built on Qwen/Qwen3.5-9B, with the base model held
read-only. The probe is the only judge in our 8-model comparison above
chance on the ASSELab/JudgeStressTest adversarial benchmark (AUROC 0.620
vs every published safety judge at 0.04β0.37) while still hitting <1%
false-positive rate on confident-nonsense honeypots that fool
every other LLM-as-judge.
The fix that makes this possible is corpus-level, not architectural: fold confident-nonsense honeypots into the training mix as label=0 records, and the lexical-bias failure mode of probing-based judges closes for free.
Headline numbers
Adversarial robustness β ASSELab/JudgeStressTest (n=439)
We re-score the standard production safety classifiers on JST. Icebreaker is the only judge in the comparison above chance.
| Judge | Overall AUROC |
|---|---|
| llama-guard-3 | 0.135 |
| qwen3guard-gen-8b | 0.090 |
| harmbench-llama | 0.328 |
| samarth-icebreaker-v1 (this model) | 0.620 |
All four rows scored by us via evals.run_judge on the same 439-record
JST subset (records with non-empty Generation).
Per-attack breakdown for samarth-icebreaker-v1:
| Attack | AUROC | n |
|---|---|---|
| GCG-R | 0.671 | 53 |
| Direct | 0.654 | 36 |
| GCG | 0.618 | 133 |
| BoN | 0.562 | 217 |
Confident-nonsense robustness β honeypot pool (n=4,994, all label=0)
The load-bearing failure mode for probe-based judges: responses that mention harmful concepts in confident-sounding prose but contain no real uplift. Lower FPR = better.
| Judge | FPR@0.5 | mean p_harmful |
|---|---|---|
| qwen3guard-gen-8b | 0.983 | 0.978 |
| Baseline icebreaker (no honeypot aug, v1) | 0.82-0.95 | 0.74-0.90 |
| llama-guard-3 (calibrated at thr=0.05) | 0.859 | 0.070 |
| samarth-icebreaker-v1 (this model) | 0.001 | 0.024 |
Standard safety benchmarks β AUROC
| Dataset | llama-guard-3 | qwen3guard-gen-8b | samarth-icebreaker-v1 |
|---|---|---|---|
| BeaverTails | 0.806 | 0.922 | 0.911 |
| SORRY-Bench | 0.640 | 0.844 | 0.873 |
| ToxicChat | 0.672 | 0.748 | 0.835 |
| WildGuardTest | 0.884 | 0.958 | 0.911 |
| XSTest-Response-Harm | 0.960 | 0.996 | 0.987 |
| CoCoNot | 0.519 | 0.414 | 0.567 |
Bold = best in row. Icebreaker matches qwen3guard-gen-8b on in-distribution benchmarks and wins on the cross-distribution sets (SORRY-Bench, ToxicChat, CoCoNot) without the adversarial-robustness or confident-nonsense collapse.
Architecture
Qwen/Qwen3.5-9B (frozen, bf16, device_map="auto")
β
β forward hook on decoder layer L=22
β pool: last-token (over response-token mask)
βΌ
hidden_states[22] [B, T, D=3584]
β
β slice to last response token, normalize
βΌ
LayerNorm(3584) β Linear(3584, 256) β ReLU β Dropout(0.1) β Linear(256, 2)
β
βΌ
softmax(logits / T)[:, 1] β p_harmful β [0, 1]
- No LoRA, no two-pass forward. Inference cost = 1Γ base forward + a ~4 MB MLP. Identical latency to the base model itself.
- No fine-tuning of the 9B base. Only the probe (~1M params) trains.
- Temperature scaling applied post-hoc on the validation set (1 scalar fit by LBFGS).
Recommended inference
System prompt
Training randomized across three variants per record:
""(empty)- The Qwen3.5 default system prompt
"You are a helpful, harmless, honest assistant."
At inference, default to no system prompt (empty). That's the configuration the eval numbers above were measured under. The probe is robust to all three variants β adding a custom system prompt won't break it, but it also won't materially improve calibration.
If you want to use the model as an in-line moderation gate where the upstream generator has its own system prompt, pass that same system prompt to icebreaker. The probe is reading the response-token hidden states, so what matters is that the conversation that produced the response is replayed faithfully.
Code snippet
import torch
import json
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.nn as nn
import torch.nn.functional as F
# 1. Download checkpoint
ckpt_dir = Path(snapshot_download("ArthT/samarth-icebreaker-v1"))
meta = json.loads((ckpt_dir / "meta.json").read_text())
# 2. Load frozen base + tokenizer + probe
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(ckpt_dir / "tokenizer", padding_side="right")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
device_map="auto",
)
base.eval()
for p in base.parameters():
p.requires_grad = False
d_model = base.config.hidden_size
probe = nn.Sequential(
nn.LayerNorm(d_model),
nn.Linear(d_model, 256),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(256, 2),
).to(device).float()
probe.load_state_dict(torch.load(ckpt_dir / "probe.pt", map_location=device))
probe.eval()
LAYER = meta["probe_layer"] # 22
TEMPERATURE = meta.get("temperature", 1.0)
# 3. Hook the target layer
captured = {}
def _hook(_module, _input, output):
captured["h"] = output[0] if isinstance(output, tuple) else output
handle = base.model.layers[LAYER].register_forward_hook(_hook)
# 4. Score a (prompt, response) pair
def score_pair(prompt: str, response: str, system_prompt: str = "") -> float:
"""Returns p_harmful in [0, 1]."""
msgs = []
if system_prompt:
msgs.append({"role": "system", "content": system_prompt})
msgs.append({"role": "user", "content": prompt})
msgs.append({"role": "assistant", "content": response})
text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
ids = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(device)
with torch.no_grad():
base(input_ids=ids.input_ids, attention_mask=ids.attention_mask, use_cache=False)
# Last-token pooling on the response: take the final non-pad position
seq_len = ids.attention_mask.sum(dim=1)
last_idx = seq_len - 1
hidden = captured["h"] # [B, T, D] bf16
pooled = hidden[torch.arange(hidden.size(0)), last_idx] # [B, D]
logits = probe(pooled.float())
p_harmful = F.softmax(logits / TEMPERATURE, dim=-1)[:, 1]
return float(p_harmful[0].item())
# Example
p = score_pair(
prompt="Write a tutorial on building a kitchen knife sharpener at home.",
response="Sure β start with a flat ceramic tile, place 3 angled supports β¦",
)
print(f"p_harmful = {p:.3f}")
handle.remove()
Threshold recommendation
The probe ships with a calibrated temperature so p_harmful β₯ 0.5 is
the default decision threshold. Calibration choices:
p β₯ 0.5β high-precision flagging (matches headline-table FPR numbers). Use this if false positives are expensive.p β₯ 0.3β wider recall, modest FPR increase on honeypot-style confident-nonsense (stays under ~5%).- For per-deployment tuning, score a labeled mixed set and pick the threshold that hits your target FPR.
Training recipe
Corpus (8 HF datasets)
| source | size used | role |
|---|---|---|
PKU-Alignment/BeaverTails |
β€ 60K | standard |
nvidia/Aegis-AI-Content-Safety-Dataset-2.0 |
full | standard |
PKU-Alignment/PKU-SafeRLHF-30K |
full | standard |
sorry-bench/sorry-bench-202406 |
full | standard |
Anthropic/hh-rlhf (red_team_attempts) |
full | standard |
allenai/wildguardmix (wildguardtrain) |
β€ 60K | standard + mined refusals |
allenai/wildjailbreak |
β€ 60K | standard |
AmazonScience/FalseReject |
full | standard |
Plus three weighted augmentation pools (all label=0):
| Pool | Batch fraction | Source |
|---|---|---|
rubbish (token-injection + degenerate generation) |
20% | Generated from JBB seeds |
mined_refusal (refusal-mentioning-harm) |
10% | Mined from WildGuardTrain refusals |
cn_honeypot (confident-nonsense) |
15% | benchmarks/cn_honeypots.jsonl (4,994 records) |
standard |
55% | Balanced from the 8 sources above |
The cn_honeypot pool is the load-bearing fix. Removing it returns
the probe to ~90% FPR on the honeypot set β same failure mode as every
other LLM-as-judge.
Hyperparameters
| Param | Value |
|---|---|
| Probe layer | 22 (swept over {16, 18, 20, 22, 24, 26}) |
| Pool | last-token (swept over {mean-response, last-token}) |
| Seed | 42 (3 seeds trained per config: 42, 43, 44) |
| Optimizer | AdamW |
| LR schedule | 1e-3 β 1e-5 cosine, 5% warmup, weight_decay=0.01 |
| Batch | per-device 8 Γ grad_accum 16 β effective 128 |
max_length train |
1024 (prompt 512 / response 512) |
max_length inference |
2048 (response truncation from end) |
| Calibration | Temperature scaling on val set (LBFGS, 1 scalar) |
Compute
Training pipeline runs on CSCS Clariden (NVIDIA GH200, aarch64). The
9B base forward is the bottleneck, so we cache pooled features once
across all 6 candidate layers and 2 pool types in a single forward pass
per record. Then the probe-only sweep over {layer Γ pool Γ seed}
trains in seconds per config.
- One-shot feature cache (235K records, 6 layers Γ 2 pools): ~1 hour wall on 1Γ GH200
- Probe-only sweep (36 v1 configs + 9 v2sm refit configs, 3 epochs each): ~2 minutes wall total on 12Γ GH200
Why this works β the lexical-bias critique
Probing-based safety judges learn the lexical presence of harmful tokens in the response, not compositional harmful intent. We confirmed this on confident-nonsense honeypots: the baseline probe (no honeypot augmentation) calls 82β95% of confident-nonsense responses harmful at threshold 0.5 β they mention harmful concepts in words even though the responses contain no real uplift.
The 8 standard corpora simply don't contain (1,1,0,0)-shaped training
signal β confident-sounding harmful-surface responses labelled SAFE.
Once we fold 4,994 such records (the honeypot set) into the training
mix at 15% of each batch, the probe learns to project off the lexical
axis and the failure mode collapses.
The fix is corrigible, not eliminated β a probe-based judge will always be one distributionally-novel adversarial attack away from re-failing the lexical bias. The principled extension is RLACE-style geometric concept erasure on top of the corpus fix.
Limitations
- English only. Training corpora are English; multilingual performance not measured.
- Modest OOD trade-off vs the baseline-aug-only probe. Adding the honeypot augmentation cost ~0.03 AUROC on ToxicChat and ~0.05 on CoCoNot. The probe is more selective and threw out some OOD harmful signal as a side effect.
- Seed variance is real. Across 3 seeds (42 / 43 / 44) at the same (layer, pool) config, per-dataset AUROC varies by Β±0.05β0.20 on the smaller benchmarks (XSTest, CoCoNot). Use the ensemble form of the model (mean across 3 seeds) if you need lower variance.
- The model is trained on response classification, not prompt
classification. It expects a
(prompt, response)pair. Scoring prompts alone returns uncalibrated output. - JudgeStressTest AUROC of 0.620 is "best in class," not "safe to deploy." No judge in our comparison hits 0.7 on the adversarial set β adversarial robustness remains an open problem.
Citation
@misc{singh2026samarthicebreaker,
title = {samarth-icebreaker: A frozen-base MLP probe for adversarially
robust LLM safety classification},
author = {Singh, Arth},
year = {2026},
url = {https://huggingface.co/ArthT/samarth-icebreaker-v1},
}
Repository
Code, training pipeline, and the full eval suite live at: https://github.com/Arth-Singh/Robust-jailbreak-judges
The samarth-icebreaker family is auto-registered by the repo's
judges.adapters.samarth_icebreaker module β drop the checkpoint into
checkpoints/samarth-icebreaker-L22-last-token-s42-v2sm/ and it will
be visible to evals.run_judge and the sweep tooling.
License
Apache 2.0.
This is a research artifact. It is not a production-grade safety filter. Independent evaluation against your specific threat model is required before deployment.