Privacy Filter Multi-Task πŸ”’πŸ“„

A single model for simultaneous PII Detection (NER) and Document Classification (10 categories).

Adapted from openai/privacy-filter β€” a 1.4B Sparse MoE transformer with only ~50M active parameters per token.

Architecture

Input β†’ BPE Tokenizer (o200k_base, 200K vocab)
  ↓
8-layer Sparse MoE Transformer
  β€’ 128 experts, top-4 routing (~50M active params/token)
  β€’ Banded sliding-window attention (window=128)
  β€’ GQA: 14 query heads, 2 KV heads, head_dim=64
  β€’ Hidden size: 640
  ↓                          ↓
NER Head (640β†’33)        Doc Head (mean-pool β†’ 640β†’10)
  ↓                          ↓
BIOES PII tags            10-class document category

Results

PII Detection (NER)

Metric Value
F1 (strict span-level) 0.493
Precision 0.697
Recall 0.381
Token Accuracy 0.944

8 entity types: private_person Β· private_email Β· private_phone Β· private_address Β· private_date Β· private_url Β· account_number Β· secret

Document Classification (10 classes)

Split Accuracy
Val 0.470
Test 0.478

Per-class test accuracy:

Category Accuracy
Computers & Internet 0.688
Family & Relationships 0.615
Science & Mathematics 0.556
Health 0.524
Sports 0.523
Politics & Government 0.493
Entertainment & Music 0.444
Society & Culture 0.363
Education & Reference 0.310
Business & Finance 0.263

πŸš€ Production Inference Guide

All numbers below are measured on real hardware with both task heads (NER + doc classification) executing on every call. Benchmark script: single forward pass produces PII entity tags and document category simultaneously.

Resource Requirements

Resource Value
Model weights (bf16) 2.8 GB GPU VRAM / RAM
Model weights (fp32) 5.6 GB RAM
ONNX variants available upstream fp16, int8, q4 (see openai/privacy-filter)
Min GPU VRAM (bs=1, seq≀512) 2.9 GB
Min GPU VRAM (bs=64, seq=512) 6.2 GB
Fits on T4 (16 GB), L4 (24 GB), A10G (24 GB), A100, any β‰₯8 GB GPU

GPU β€” Single-Document Latency (NVIDIA A10G, bf16)

Time from raw text to both NER tags + document category:

Sequence Length Latency (mean) Latency (p95) Latency (p99)
64 tokens 113 ms 117 ms 122 ms
128 tokens 106 ms 110 ms 115 ms
256 tokens 106 ms 111 ms 113 ms
512 tokens 106 ms 113 ms 116 ms

Latency is dominated by a fixed ~105 ms kernel-launch overhead from the Sparse MoE routing β€” it barely changes with sequence length up to 512 tokens.

GPU β€” Batched Throughput (NVIDIA A10G, bf16)

Batch Size Seq 64 Seq 128 Seq 256 Seq 512
1 8.9 docs/s 9.4 docs/s 9.4 docs/s 9.4 docs/s
4 36 docs/s 37 docs/s 37 docs/s 32 docs/s
8 73 docs/s 73 docs/s 69 docs/s 53 docs/s
16 139 docs/s 138 docs/s 114 docs/s 73 docs/s
32 265 docs/s 238 docs/s 165 docs/s 89 docs/s
64 460 docs/s 348 docs/s 207 docs/s 101 docs/s

GPU β€” Batched Latency Detail (NVIDIA A10G, bf16)

Full latency table (click to expand)
Batch Seq Len Batch Latency (ms) Per-Doc (ms) p95 (ms) p99 (ms)
1 64 113 112.7 117 122
4 64 111 27.8 116 118
8 64 110 13.8 114 126
16 64 115 7.2 121 125
32 64 121 3.8 127 135
64 64 139 2.2 144 144
1 128 106 105.9 110 115
4 128 107 26.9 112 115
8 128 110 13.7 115 116
16 128 116 7.3 121 128
32 128 134 4.2 139 143
64 128 184 2.9 189 191
1 256 106 106.1 111 113
4 256 109 27.2 114 115
8 256 117 14.6 123 126
16 256 140 8.8 145 147
32 256 194 6.1 199 202
64 256 309 4.8 314 315
1 512 106 106.5 113 116
4 512 125 31.2 129 130
8 512 152 19.0 158 165
16 512 219 13.7 223 225
32 512 358 11.2 361 364
64 512 636 9.9 639 641

GPU β€” Peak VRAM Usage (bf16)

Batch Size Seq 128 Seq 256 Seq 512
1 2,817 MB 2,824 MB 2,862 MB
8 2,857 MB 2,936 MB 3,237 MB
32 3,000 MB 3,309 MB 4,522 MB
64 3,189 MB 3,809 MB 6,236 MB

The model is extremely memory-efficient. Even at batch=64, seq=512, it uses only 6.2 GB β€” comfortably fits on a T4 (16 GB). This is because the Sparse MoE only activates 4 of 128 experts per token.

CPU β€” Latency & Throughput (AMD EPYC 7R32, 8 cores, fp32)

Batch Seq 64 Seq 128 Seq 256 Seq 512
1 152 ms (6.6/s) 193 ms (5.2/s) 302 ms (3.3/s) 569 ms (1.8/s)
4 278 ms (14.4/s) 468 ms (8.6/s) 935 ms (4.3/s) 2,464 ms (1.6/s)
8 467 ms (17.1/s) 862 ms (9.3/s) 1,728 ms (4.6/s) 4,745 ms (1.7/s)
16 837 ms (19.1/s) 1,624 ms (9.9/s) 3,814 ms (4.2/s) 9,143 ms (1.7/s)

On CPU the model runs at ~152 ms/doc for short texts (seq=64, bs=1) β€” suitable for low-volume or batch-offline pipelines.

Daily Throughput Projections

Sustained throughput for a single device, running 24/7 at the optimal batch size:

Sequence Length GPU (A10G, bf16) CPU (8-core, fp32)
64 tokens 39.8M docs/day (460/s, bs=64) 1.7M docs/day (19/s, bs=16)
128 tokens 30.1M docs/day (348/s, bs=64) 855K docs/day (10/s, bs=16)
256 tokens 17.9M docs/day (207/s, bs=64) 397K docs/day (4.6/s, bs=8)
512 tokens 8.7M docs/day (101/s, bs=64) 156K docs/day (1.8/s, bs=1)

Multi-GPU Scaling Estimates

Config seq=128 seq=256 seq=512
1Γ— A10G (24 GB, ~$1/hr) 30M/day 18M/day 8.7M/day
1Γ— A100 (80 GB, ~$3/hr) ~70M/dayΒΉ ~42M/dayΒΉ ~20M/dayΒΉ
4Γ— A10G data-parallel 120M/day 72M/day 35M/day
8Γ— A10G data-parallel 240M/day 143M/day 70M/day

ΒΉ A100 estimates are linearly extrapolated from A10G numbers using A100's ~2.3Γ— higher memory bandwidth and larger batch capacity. Actual numbers will vary β€” benchmark on your target hardware.

Serving Recommendations

Deployment Scenario Recommended Config Expected Perf
Real-time API (SLA <200ms) 1Γ— GPU, bs=1, seq≀512 ~106 ms p50, ~113 ms p95
Near-real-time (SLA <500ms) 1Γ— GPU, bs=8–16, seq≀512 53–73 docs/s, p95 <225 ms
High-throughput batch 1Γ— GPU, bs=64, seq=256 207 docs/s, 17.9M/day
Max throughput batch 1Γ— GPU, bs=64, seq=64Β² 460 docs/s, 39.8M/day
CPU offline / dev CPU, bs=1, seq≀256 3–7 docs/s

Β² At seq=64 most documents will be truncated. Use seq=128–256 for production balance.

Key observations:

  • The model has a fixed ~105 ms overhead per forward pass regardless of sequence length (MoE routing + expert dispatch). Batching amortizes this cost across documents β€” the per-doc cost drops from 106 ms (bs=1) to under 10 ms (bs=64).
  • Memory is not the bottleneck β€” even at bs=64/seq=512 the model uses only 6.2 GB. You can run this on a T4 (16 GB) with room to spare.
  • Optimal batch size for throughput: bs=64 for all sequence lengths on A10G.
  • Optimal batch size for latency-constrained: bs=8–16 gives a good per-doc latency (13–19 ms) while keeping batch latency under 225 ms.

Training Strategy

Two-phase training approach:

  1. Phase 1 β€” Multi-task fine-tuning: Partially unfroze last 4 MoE layers + both task heads. Trained on 20K NER examples (ai4privacy) + 20K doc examples (Yahoo Answers). Multi-task loss (NERΓ—1.0 + DocΓ—0.5). 2 epochs, LR=2e-5.

  2. Phase 2 β€” Doc head retraining (head-only): Froze entire backbone + NER head. Pre-computed 640-dim pooled features for 100K Yahoo Answers examples. Trained fresh Linear(640β†’10) classifier for 10 epochs, LR=1e-3, cosine decay. This approach:

    • Preserves NER performance exactly (backbone untouched)
    • Is extremely fast (~seconds per epoch on cached features)
    • Achieves 47.8% test accuracy (up from 24.8% in phase 1)

Usage

import torch
import torch.nn as nn
from transformers import AutoModelForTokenClassification, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("binga/privacy-filter-multitask")
model = AutoModelForTokenClassification.from_pretrained(
    "binga/privacy-filter-multitask", dtype=torch.bfloat16, device_map="auto"
)

# Load document classification head
doc_head = nn.Linear(640, 10)
doc_head.load_state_dict(torch.load(
    hf_hub_download("binga/privacy-filter-multitask", "doc_head.pt"),
    weights_only=True, map_location=model.device
))
doc_head = doc_head.to(dtype=torch.bfloat16, device=model.device)
doc_head.eval()

# Inference
text = "John Smith (SSN: 123-45-6789) emailed john@corp.com about Q3 earnings."
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# === PII Detection ===
print("PII entities:")
for tok, pred in zip(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
    outputs.logits.argmax(-1)[0]
):
    label = model.config.id2label[pred.item()]
    if label != "O":
        print(f"  {tok} β†’ {label}")

# === Document Classification ===
categories = [
    "Society & Culture", "Science & Math", "Health", "Education",
    "Computers & Internet", "Sports", "Business & Finance",
    "Entertainment", "Family", "Politics"
]
hidden = outputs.hidden_states[-1]
mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
probs = torch.softmax(doc_head(pooled)[0].float(), dim=-1)
top = probs.argmax().item()
print(f"\nCategory: {categories[top]} ({probs[top]:.1%})")

Batched Inference (Production)

# Process a batch of documents β€” both tasks in a single forward pass
texts = ["doc1...", "doc2...", "doc3...", ...]
inputs = tokenizer(texts, return_tensors="pt", padding=True,
                   truncation=True, max_length=256).to(model.device)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# NER predictions for all docs: [batch, seq_len]
ner_preds = outputs.logits.argmax(dim=-1)

# Doc class for all docs: [batch]
hidden = outputs.hidden_states[-1]
mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
doc_preds = doc_head(pooled).argmax(dim=-1)

Example Outputs

Input PII Detected Category (confidence)
"My name is John Smith... email john@example.com" βœ… John Smith, john@example.com, 123 Main St Computers & Internet (56%)
"Liverpool FC defeated Manchester City 3-1" ❌ None Sports (98%)
"Federal Reserve announced a rate cut" ❌ None Politics (52%)
"health benefits of meditation and yoga" ❌ None Health (38%)
"Patient Jane Doe (SSN: 123-45-6789)" βœ… Jane Doe, 123-45-6789, jane.doe@hospital.com Education (41%)
"learn programming? I want to learn Python" ❌ None Education (53%)
"legal to record phone calls in California?" ❌ None Politics (64%)

Files

File Size Description
model.safetensors 2.6 GB Backbone + NER head (1.4B MoE params)
doc_head.pt 26 KB Document classification head (640β†’10)
config.json 3 KB Model architecture config
tokenizer.json 27 MB BPE tokenizer (o200k_base)
multitask_config.json 349 B Multi-task metadata
Downloads last month
70
Safetensors
Model size
1B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for binga/privacy-filter-multitask

Finetuned
(15)
this model

Datasets used to train binga/privacy-filter-multitask

Evaluation results