MERIT-XS v2 (Research Preview)
MERIT-XS is a compact moderation encoder developed by Meridian Safety.
v2 upgrades from v1:
- Multi-head taxonomy (8 production heads) replacing the single binary toxicity head
- Calibrated per-head thresholds from dev F1 sweep
- Unicode/leetspeak evasion normaliser included
- Updated encoder with TPU-optimised attention (chunked local attention, no
unfold)
Included files
| File | Description |
|---|---|
merit_xs_multitask_v2.pt |
Multi-head moderation bundle (encoder + 10 heads) |
infer_merit_xs.py |
CLI inference entrypoint |
load_merit_xs.py |
Python loader โ multi-head API |
evasion_normaliser.py |
Unicode homoglyph / leetspeak normaliser |
metrics_summary.json |
Per-head dev/test AUROC, F1, calibrated thresholds |
merit/ |
Local model package |
assets/tokenizers/merit/ |
DeBERTa v3 tokenizer (128k vocab) |
Setup
pip install -r requirements.txt
CLI usage
# Single message
python infer_merit_xs.py \
--checkpoint merit_xs_multitask_v2.pt \
--text "you are awful" \
--text "send me a picture"
# Batch from file
python infer_merit_xs.py \
--checkpoint merit_xs_multitask_v2.pt \
--input-file messages.txt \
--output-file results.jsonl
Python usage
from load_merit_xs import load_merit_xs
model = load_merit_xs() # auto-detects merit_xs_multitask_v2.pt
results = model.predict([
"you are awful",
"send me a pic, just us",
"what time does school end",
])
for r in results:
print(r)
Output schema
{
"text": str,
"scores": {
"toxicity": float, # threshold 0.35
"harassment_insult": float, # threshold 0.40
"threat_violence": float, # threshold 0.30
"identity_hate": float, # threshold 0.35
"sexual_explicit": float, # threshold 0.65
"grooming": float, # threshold 0.60
"prompt_injection": float, # threshold 0.70
"overall_risk": float, # threshold 0.40
},
"flags": list[str], # heads that exceeded their threshold
"flagged": bool, # any head exceeded threshold
"evasion_score": float, # 0.0 = clean text, 1.0 = heavy obfuscation
}
Performance (dev set, calibrated thresholds)
| Head | AUROC | F1 | Threshold |
|---|---|---|---|
| toxicity | 0.964 | 0.838 | 0.35 |
| harassment_insult | 0.979 | 0.897 | 0.40 |
| threat_violence | 0.938 | 0.682 | 0.30 |
| identity_hate | 0.971 | 0.855 | 0.35 |
| sexual_explicit | 0.996 | 0.950 | 0.65 |
| grooming | 0.976 | 0.750 | 0.60 |
| prompt_injection | 0.986 | 0.890 | 0.70 |
| overall_risk | 0.964 | 0.846 | 0.40 |
| macro average | 0.972 | 0.838 | โ |
self_harm and extremism heads are present in the bundle but excluded from default output due to training data distribution issues.
Evasion normaliser
The included evasion_normaliser.py maps 600+ unicode homoglyphs, leetspeak substitutions, diacritics, and shattered words to ASCII before scoring. Applied by default in load_merit_xs.py.
model = load_merit_xs()
# evasion normaliser is on by default
results = model.predict(["h3ll0 k1d w4nn4 t4lk"], apply_evasion_normaliser=True)
print(results[0]["evasion_score"]) # > 0 indicates obfuscation detected
License
MERIT Research Preview License (MRPL v1.0) โ research, evaluation, and benchmarking use permitted. Commercial deployment and hosted/public API use require separate permission from Meridian Safety.
Current limitations
- Message-level only โ does not model conversation trajectory (use MERIT-S for multi-turn)
- Not a production safety system โ do not use as sole enforcement mechanism
- English-primary โ multilingual coverage is partial
Research note
This package is intended for research, benchmarking, representation-transfer experiments, and moderation evaluation. It is not intended for production moderation, safety-critical enforcement, or fully automated policy decisions.