You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

MERIT-XS v2 (Research Preview)

MERIT-XS is a compact moderation encoder developed by Meridian Safety.

v2 upgrades from v1:

Multi-head taxonomy (8 production heads) replacing the single binary toxicity head
Calibrated per-head thresholds from dev F1 sweep
Unicode/leetspeak evasion normaliser included
Updated encoder with TPU-optimised attention (chunked local attention, no unfold)

Included files

File	Description
`merit_xs_multitask_v2.pt`	Multi-head moderation bundle (encoder + 10 heads)
`infer_merit_xs.py`	CLI inference entrypoint
`load_merit_xs.py`	Python loader — multi-head API
`evasion_normaliser.py`	Unicode homoglyph / leetspeak normaliser
`metrics_summary.json`	Per-head dev/test AUROC, F1, calibrated thresholds
`merit/`	Local model package
`assets/tokenizers/merit/`	DeBERTa v3 tokenizer (128k vocab)

Setup

pip install -r requirements.txt

CLI usage

# Single message
python infer_merit_xs.py \
  --checkpoint merit_xs_multitask_v2.pt \
  --text "you are awful" \
  --text "send me a picture"

# Batch from file
python infer_merit_xs.py \
  --checkpoint merit_xs_multitask_v2.pt \
  --input-file messages.txt \
  --output-file results.jsonl

Python usage

from load_merit_xs import load_merit_xs

model = load_merit_xs()  # auto-detects merit_xs_multitask_v2.pt
results = model.predict([
    "you are awful",
    "send me a pic, just us",
    "what time does school end",
])
for r in results:
    print(r)

Output schema

{
    "text": str,
    "scores": {
        "toxicity": float,           # threshold 0.35
        "harassment_insult": float,  # threshold 0.40
        "threat_violence": float,    # threshold 0.30
        "identity_hate": float,      # threshold 0.35
        "sexual_explicit": float,    # threshold 0.65
        "grooming": float,           # threshold 0.60
        "prompt_injection": float,   # threshold 0.70
        "overall_risk": float,       # threshold 0.40
    },
    "flags": list[str],   # heads that exceeded their threshold
    "flagged": bool,      # any head exceeded threshold
    "evasion_score": float,  # 0.0 = clean text, 1.0 = heavy obfuscation
}

Performance (dev set, calibrated thresholds)

Head	AUROC	F1	Threshold
toxicity	0.964	0.838	0.35
harassment_insult	0.979	0.897	0.40
threat_violence	0.938	0.682	0.30
identity_hate	0.971	0.855	0.35
sexual_explicit	0.996	0.950	0.65
grooming	0.976	0.750	0.60
prompt_injection	0.986	0.890	0.70
overall_risk	0.964	0.846	0.40
macro average	0.972	0.838	—

self_harm and extremism heads are present in the bundle but excluded from default output due to training data distribution issues.

Evasion normaliser

The included evasion_normaliser.py maps 600+ unicode homoglyphs, leetspeak substitutions, diacritics, and shattered words to ASCII before scoring. Applied by default in load_merit_xs.py.

model = load_merit_xs()
# evasion normaliser is on by default
results = model.predict(["h3ll0 k1d w4nn4 t4lk"], apply_evasion_normaliser=True)
print(results[0]["evasion_score"])  # > 0 indicates obfuscation detected

License

MERIT Research Preview License (MRPL v1.0) — research, evaluation, and benchmarking use permitted. Commercial deployment and hosted/public API use require separate permission from Meridian Safety.

Current limitations

Message-level only — does not model conversation trajectory (use MERIT-S for multi-turn)
Not a production safety system — do not use as sole enforcement mechanism
English-primary — multilingual coverage is partial

Research note

This package is intended for research, benchmarking, representation-transfer experiments, and moderation evaluation. It is not intended for production moderation, safety-critical enforcement, or fully automated policy decisions.

Downloads last month: -; Downloads are not tracked for this model. How to track