TensorGreed
/

SensiGuard-PII

Token Classification

Generated from Trainer

Model card Files Files and versions

TensorGreed commited on 6 days ago

Commit

b85dca5

·

verified ·

1 Parent(s): ccf904d

Update README.md

Files changed (1) hide show

README.md +5 -1

README.md CHANGED Viewed

@@ -59,7 +59,11 @@ print(nlp(text))
 ## Training and evaluation data
-More information needed
 ## Training procedure

 ## Training and evaluation data
+- Sources: Mixed synthetic + public/weak-labeled PII corpora. Synthetic data was generated with pattern templates and optional LLM augmentation (vLLM/OpenAI-compatible) to cover names, emails, phones, SSN, PCI (card number/expiry/CVV/last4), bank account/routing, IPs, credentials, and healthcare identifiers. Public components include Nemotron-PII, AI4Privacy PII, Mendeley financial PII, and optional weak-labeling over Enron-style text. Labels were normalized into a common schema; unsupported labels were dropped.
+- Splits: If no validation file is provided, the training JSONL is auto-split 90/10 (train/val) with train_test_split(test_size=0.1, seed=42).
+- Class balancing: Inverse-frequency class weights were applied to mitigate the dominant O class.
+- Notes: PCI coverage includes spaced/dashed card formats and expiries; regex/Luhn hard negatives were used to reduce false positives. Evaluation metrics are token-level precision/recall/F1 (seqeval) on the held-out validation split.
+- Limitations: Mostly English; domain and format shifts may impact performance. Test on your own data and adjust thresholds/label mappings as needed.
 ## Training procedure