Update README.md
Browse files
README.md
CHANGED
|
@@ -59,7 +59,11 @@ print(nlp(text))
|
|
| 59 |
|
| 60 |
## Training and evaluation data
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Training procedure
|
| 65 |
|
|
|
|
| 59 |
|
| 60 |
## Training and evaluation data
|
| 61 |
|
| 62 |
+
- Sources: Mixed synthetic + public/weak-labeled PII corpora. Synthetic data was generated with pattern templates and optional LLM augmentation (vLLM/OpenAI-compatible) to cover names, emails, phones, SSN, PCI (card number/expiry/CVV/last4), bank account/routing, IPs, credentials, and healthcare identifiers. Public components include Nemotron-PII, AI4Privacy PII, Mendeley financial PII, and optional weak-labeling over Enron-style text. Labels were normalized into a common schema; unsupported labels were dropped.
|
| 63 |
+
- Splits: If no validation file is provided, the training JSONL is auto-split 90/10 (train/val) with train_test_split(test_size=0.1, seed=42).
|
| 64 |
+
- Class balancing: Inverse-frequency class weights were applied to mitigate the dominant O class.
|
| 65 |
+
- Notes: PCI coverage includes spaced/dashed card formats and expiries; regex/Luhn hard negatives were used to reduce false positives. Evaluation metrics are token-level precision/recall/F1 (seqeval) on the held-out validation split.
|
| 66 |
+
- Limitations: Mostly English; domain and format shifts may impact performance. Test on your own data and adjust thresholds/label mappings as needed.
|
| 67 |
|
| 68 |
## Training procedure
|
| 69 |
|