Privacy Filter Multi-Task 🔒📄

A single model for simultaneous PII Detection (NER) and Document Classification (10 categories).

Adapted from openai/privacy-filter — a 1.4B Sparse MoE transformer with only ~50M active parameters per token.

Architecture

Input → BPE Tokenizer (o200k_base, 200K vocab)
  ↓
8-layer Sparse MoE Transformer
  • 128 experts, top-4 routing (~50M active params/token)
  • Banded sliding-window attention (window=128)
  • GQA: 14 query heads, 2 KV heads, head_dim=64
  • Hidden size: 640
  ↓                          ↓
NER Head (640→33)        Doc Head (mean-pool → 640→10)
  ↓                          ↓
BIOES PII tags            10-class document category

Results

PII Detection (NER)

Metric	Value
F1 (strict span-level)	0.493
Precision	0.697
Recall	0.381
Token Accuracy	0.944

8 entity types: private_person · private_email · private_phone · private_address · private_date · private_url · account_number · secret

Document Classification (10 classes)

Split	Accuracy
Val	0.470
Test	0.478

Per-class test accuracy:

Category	Accuracy
Computers & Internet	0.688
Family & Relationships	0.615
Science & Mathematics	0.556
Health	0.524
Sports	0.523
Politics & Government	0.493
Entertainment & Music	0.444
Society & Culture	0.363
Education & Reference	0.310
Business & Finance	0.263

🚀 Production Inference Guide

All numbers below are measured on real hardware with both task heads (NER + doc classification) executing on every call. Benchmark script: single forward pass produces PII entity tags and document category simultaneously.

Resource Requirements

Resource	Value
Model weights (bf16)	2.8 GB GPU VRAM / RAM
Model weights (fp32)	5.6 GB RAM
ONNX variants available upstream	fp16, int8, q4 (see openai/privacy-filter)
Min GPU VRAM (bs=1, seq≤512)	2.9 GB
Min GPU VRAM (bs=64, seq=512)	6.2 GB
Fits on	T4 (16 GB), L4 (24 GB), A10G (24 GB), A100, any ≥8 GB GPU

GPU — Single-Document Latency (NVIDIA A10G, bf16)

Time from raw text to both NER tags + document category:

Sequence Length	Latency (mean)	Latency (p95)	Latency (p99)
64 tokens	113 ms	117 ms	122 ms
128 tokens	106 ms	110 ms	115 ms
256 tokens	106 ms	111 ms	113 ms
512 tokens	106 ms	113 ms	116 ms

Latency is dominated by a fixed ~105 ms kernel-launch overhead from the Sparse MoE routing — it barely changes with sequence length up to 512 tokens.

GPU — Batched Throughput (NVIDIA A10G, bf16)

Batch Size	Seq 64	Seq 128	Seq 256	Seq 512
1	8.9 docs/s	9.4 docs/s	9.4 docs/s	9.4 docs/s
4	36 docs/s	37 docs/s	37 docs/s	32 docs/s
8	73 docs/s	73 docs/s	69 docs/s	53 docs/s
16	139 docs/s	138 docs/s	114 docs/s	73 docs/s
32	265 docs/s	238 docs/s	165 docs/s	89 docs/s
64	460 docs/s	348 docs/s	207 docs/s	101 docs/s

GPU — Batched Latency Detail (NVIDIA A10G, bf16)

Full latency table (click to expand)

Batch	Seq Len	Batch Latency (ms)	Per-Doc (ms)	p95 (ms)	p99 (ms)
1	64	113	112.7	117	122
4	64	111	27.8	116	118
8	64	110	13.8	114	126
16	64	115	7.2	121	125
32	64	121	3.8	127	135
64	64	139	2.2	144	144
1	128	106	105.9	110	115
4	128	107	26.9	112	115
8	128	110	13.7	115	116
16	128	116	7.3	121	128
32	128	134	4.2	139	143
64	128	184	2.9	189	191
1	256	106	106.1	111	113
4	256	109	27.2	114	115
8	256	117	14.6	123	126
16	256	140	8.8	145	147
32	256	194	6.1	199	202
64	256	309	4.8	314	315
1	512	106	106.5	113	116
4	512	125	31.2	129	130
8	512	152	19.0	158	165
16	512	219	13.7	223	225
32	512	358	11.2	361	364
64	512	636	9.9	639	641

GPU — Peak VRAM Usage (bf16)

Batch Size	Seq 128	Seq 256	Seq 512
1	2,817 MB	2,824 MB	2,862 MB
8	2,857 MB	2,936 MB	3,237 MB
32	3,000 MB	3,309 MB	4,522 MB
64	3,189 MB	3,809 MB	6,236 MB

The model is extremely memory-efficient. Even at batch=64, seq=512, it uses only 6.2 GB — comfortably fits on a T4 (16 GB). This is because the Sparse MoE only activates 4 of 128 experts per token.

CPU — Latency & Throughput (AMD EPYC 7R32, 8 cores, fp32)

Batch	Seq 64	Seq 128	Seq 256	Seq 512
1	152 ms (6.6/s)	193 ms (5.2/s)	302 ms (3.3/s)	569 ms (1.8/s)
4	278 ms (14.4/s)	468 ms (8.6/s)	935 ms (4.3/s)	2,464 ms (1.6/s)
8	467 ms (17.1/s)	862 ms (9.3/s)	1,728 ms (4.6/s)	4,745 ms (1.7/s)
16	837 ms (19.1/s)	1,624 ms (9.9/s)	3,814 ms (4.2/s)	9,143 ms (1.7/s)

On CPU the model runs at ~152 ms/doc for short texts (seq=64, bs=1) — suitable for low-volume or batch-offline pipelines.

Daily Throughput Projections

Sustained throughput for a single device, running 24/7 at the optimal batch size:

Sequence Length	GPU (A10G, bf16)	CPU (8-core, fp32)
64 tokens	39.8M docs/day (460/s, bs=64)	1.7M docs/day (19/s, bs=16)
128 tokens	30.1M docs/day (348/s, bs=64)	855K docs/day (10/s, bs=16)
256 tokens	17.9M docs/day (207/s, bs=64)	397K docs/day (4.6/s, bs=8)
512 tokens	8.7M docs/day (101/s, bs=64)	156K docs/day (1.8/s, bs=1)

Multi-GPU Scaling Estimates

Config	seq=128	seq=256	seq=512
1× A10G (24 GB, ~$1/hr)	30M/day	18M/day	8.7M/day
1× A100 (80 GB, ~$3/hr)	~70M/day¹	~42M/day¹	~20M/day¹
4× A10G data-parallel	120M/day	72M/day	35M/day
8× A10G data-parallel	240M/day	143M/day	70M/day

_{¹ A100 estimates are linearly extrapolated from A10G numbers using A100's ~2.3× higher memory bandwidth and larger batch capacity. Actual numbers will vary — benchmark on your target hardware.}

Serving Recommendations

Deployment Scenario	Recommended Config	Expected Perf
Real-time API (SLA <200ms)	1× GPU, bs=1, seq≤512	~106 ms p50, ~113 ms p95
Near-real-time (SLA <500ms)	1× GPU, bs=8–16, seq≤512	53–73 docs/s, p95 <225 ms
High-throughput batch	1× GPU, bs=64, seq=256	207 docs/s, 17.9M/day
Max throughput batch	1× GPU, bs=64, seq=64²	460 docs/s, 39.8M/day
CPU offline / dev	CPU, bs=1, seq≤256	3–7 docs/s

_{² At seq=64 most documents will be truncated. Use seq=128–256 for production balance.}

Key observations:

The model has a fixed ~105 ms overhead per forward pass regardless of sequence length (MoE routing + expert dispatch). Batching amortizes this cost across documents — the per-doc cost drops from 106 ms (bs=1) to under 10 ms (bs=64).
Memory is not the bottleneck — even at bs=64/seq=512 the model uses only 6.2 GB. You can run this on a T4 (16 GB) with room to spare.
Optimal batch size for throughput: bs=64 for all sequence lengths on A10G.
Optimal batch size for latency-constrained: bs=8–16 gives a good per-doc latency (13–19 ms) while keeping batch latency under 225 ms.

Training Strategy

Two-phase training approach:

Phase 1 — Multi-task fine-tuning: Partially unfroze last 4 MoE layers + both task heads. Trained on 20K NER examples (ai4privacy) + 20K doc examples (Yahoo Answers). Multi-task loss (NER×1.0 + Doc×0.5). 2 epochs, LR=2e-5.
Phase 2 — Doc head retraining (head-only): Froze entire backbone + NER head. Pre-computed 640-dim pooled features for 100K Yahoo Answers examples. Trained fresh Linear(640→10) classifier for 10 epochs, LR=1e-3, cosine decay. This approach:
- Preserves NER performance exactly (backbone untouched)
- Is extremely fast (~seconds per epoch on cached features)
- Achieves 47.8% test accuracy (up from 24.8% in phase 1)

Usage

import torch
import torch.nn as nn
from transformers import AutoModelForTokenClassification, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("binga/privacy-filter-multitask")
model = AutoModelForTokenClassification.from_pretrained(
    "binga/privacy-filter-multitask", dtype=torch.bfloat16, device_map="auto"
)

# Load document classification head
doc_head = nn.Linear(640, 10)
doc_head.load_state_dict(torch.load(
    hf_hub_download("binga/privacy-filter-multitask", "doc_head.pt"),
    weights_only=True, map_location=model.device
))
doc_head = doc_head.to(dtype=torch.bfloat16, device=model.device)
doc_head.eval()

# Inference
text = "John Smith (SSN: 123-45-6789) emailed john@corp.com about Q3 earnings."
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# === PII Detection ===
print("PII entities:")
for tok, pred in zip(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
    outputs.logits.argmax(-1)[0]
):
    label = model.config.id2label[pred.item()]
    if label != "O":
        print(f"  {tok} → {label}")

# === Document Classification ===
categories = [
    "Society & Culture", "Science & Math", "Health", "Education",
    "Computers & Internet", "Sports", "Business & Finance",
    "Entertainment", "Family", "Politics"
]
hidden = outputs.hidden_states[-1]
mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
probs = torch.softmax(doc_head(pooled)[0].float(), dim=-1)
top = probs.argmax().item()
print(f"\nCategory: {categories[top]} ({probs[top]:.1%})")

Batched Inference (Production)

# Process a batch of documents — both tasks in a single forward pass
texts = ["doc1...", "doc2...", "doc3...", ...]
inputs = tokenizer(texts, return_tensors="pt", padding=True,
                   truncation=True, max_length=256).to(model.device)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# NER predictions for all docs: [batch, seq_len]
ner_preds = outputs.logits.argmax(dim=-1)

# Doc class for all docs: [batch]
hidden = outputs.hidden_states[-1]
mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
doc_preds = doc_head(pooled).argmax(dim=-1)

Example Outputs

Input	PII Detected	Category (confidence)
"My name is John Smith... email john@example.com"	✅ John Smith, john@example.com, 123 Main St	Computers & Internet (56%)
"Liverpool FC defeated Manchester City 3-1"	❌ None	Sports (98%)
"Federal Reserve announced a rate cut"	❌ None	Politics (52%)
"health benefits of meditation and yoga"	❌ None	Health (38%)
"Patient Jane Doe (SSN: 123-45-6789)"	✅ Jane Doe, 123-45-6789, jane.doe@hospital.com	Education (41%)
"learn programming? I want to learn Python"	❌ None	Education (53%)
"legal to record phone calls in California?"	❌ None	Politics (64%)

Files

File	Size	Description
`model.safetensors`	2.6 GB	Backbone + NER head (1.4B MoE params)
`doc_head.pt`	26 KB	Document classification head (640→10)
`config.json`	3 KB	Model architecture config
`tokenizer.json`	27 MB	BPE tokenizer (o200k_base)
`multitask_config.json`	349 B	Multi-task metadata

Downloads last month: 70

Safetensors

Model size

1B params

Tensor type

F32

BF16

Model tree for binga/privacy-filter-multitask

Base model

openai/privacy-filter

Finetuned

(15)

this model

Datasets used to train binga/privacy-filter-multitask

Evaluation results

F1 (strict span-level) on ai4privacy/pii-masking-400k
self-reported

0.492
precision on ai4privacy/pii-masking-400k
self-reported

0.697
recall on ai4privacy/pii-masking-400k
self-reported

0.381
Test Accuracy on yahoo_answers_topics
self-reported

0.478