msdakot/fintech-privacy-filter-v0
fintech-privacy-filter-v0 is a fine-tuned version of openai/privacy-filter specialized for fintech PII detection across 65 entity categories — extending the base model with 10 new financial identifiers (IBAN, LEI, ISIN, CUSIP, VAT numbers, etc.) and multilingual coverage across 7 European languages.
Key Specifications
- Base model:
openai/privacy-filter— 1.4B-parameter MoE (50M active per token), BIOES token-classification head - Task: Token classification for PII detection (BIOES scheme)
- Training data: 55,121 examples from Gretel, Nemotron (finance-domain), and synthetic fintech supplement
- Fine-tuning recipe:
opf train— full fine-tune, AdamW, lr=1e-4, 3 epochs, bf16, weight_decay=0.0 - Labels: 65 entity categories → 261 BIOES classes
- Languages: English, Spanish, Swedish, German, Italian, Dutch, French
Quick Start
With OpenAI's opf CLI
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
opf redact \
--checkpoint msdakot/fintech-privacy-filter-v0 \
--text "Wire €50,000 to IBAN DE89370400440532013000, LEI 529900T8BM49AURSDO55."
With Transformers
from transformers import pipeline
pipe = pipeline(
"token-classification",
model="msdakot/fintech-privacy-filter-v0",
aggregation_strategy="simple",
trust_remote_code=True,
)
text = "Wire €50,000 to IBAN DE89370400440532013000, LEI 529900T8BM49AURSDO55, ISIN US0378331005."
for r in pipe(text):
print(f"{r['word']!r:45s} → {r['entity_group']} (score={r['score']:.3f})")
Note: This model uses the
o200k_basetiktoken tokenizer (same as GPT-4o). Passtrust_remote_code=Truefor both the tokenizer and model.
Performance
Evaluated with opf eval (Viterbi decoding) on 6,891 held-out test examples:
| Metric | Value |
|---|---|
| Span F1 (overall) | 0.800 |
| Span Precision | 0.848 |
| Span Recall | 0.758 |
| Token accuracy | 0.977 |
Fintech Label Performance (New Labels)
| Label | Span F1 |
|---|---|
cusip |
1.000 |
isin |
1.000 |
lei |
1.000 |
vat_number |
1.000 |
bban |
0.912 |
iban |
0.870 |
swift_mt_ref |
— |
policy_number |
— |
loan_number |
— |
crypto_address |
— |
4 fintech labels have no test examples in v0 — coverage planned for v1.
Label Space (65 Categories)
New Fintech Labels (10)
| Label | Description |
|---|---|
iban |
International Bank Account Number (ISO 13616) |
bban |
Basic Bank Account Number (domestic IBAN component) |
lei |
Legal Entity Identifier (ISO 17442) |
isin |
International Securities Identification Number (ISO 6166) |
cusip |
North American securities identifier |
swift_mt_ref |
SWIFT MT message reference numbers |
policy_number |
Insurance policy references |
vat_number |
EU VAT identification numbers |
loan_number |
Mortgage and loan account references |
crypto_address |
Bitcoin / Ethereum / Solana wallet addresses |
General PII Labels (55)
Identity: first_name, last_name, user_name, age, gender, race_ethnicity, sexuality, religious_belief, political_view, marital_status, nationality, education_level, occupation, employment_status, language, blood_type, biometric_identifier
Contact: email, phone_number, fax_number, url
Address: street_address, city, county, state, country, postcode, coordinate
Dates: date, date_of_birth, date_time, time
Government IDs: ssn, national_id, tax_id
Financial: account_number, bank_routing_number, swift_bic, credit_debit_card, cvv, pin, password
Healthcare: medical_record_number, health_plan_beneficiary_number
Enterprise: customer_id, employee_id, unique_id, certificate_license_number, company_name
Vehicle: license_plate, vehicle_identifier
Digital: ipv4, ipv6, mac_address, device_identifier, api_key, http_cookie
Training
| Parameter | Value |
|---|---|
learning_rate |
1e-4 |
epochs |
3 |
batch_size |
2 |
grad_accumulation_steps |
4 (effective batch = 8) |
dtype |
bf16 |
weight_decay |
0.0 |
| Hardware | Google Colab T4, High-RAM runtime |
| Runtime | 5.9 hours |
Data Sources
| Dataset | License | Records used |
|---|---|---|
gretelai/synthetic_pii_finance_multilingual |
Apache 2.0 | 50,346 |
nvidia/Nemotron-PII |
CC-BY-4.0 | 24,866 (finance-domain filtered) |
| Synthetic supplement | Apache 2.0 | 390 |
Harmonized dataset: msdakot/fintech-privacy-pii
Limitations
- Synthetic training data only — real financial documents may have different surface forms.
- Name label simplification — Gretel's full-name spans are mapped to
first_name;last_namerecall is lower as a result (known v0 limitation). - Language coverage — 7 European languages only; not tested on others.
- Not a legal compliance substitute — use alongside deterministic regex filters and human review.
Credits
- OpenAI — Privacy Filter architecture, modeling code, and
opfCLI - Gretel AI —
synthetic_pii_finance_multilingualdataset - NVIDIA — Nemotron-PII dataset
- OpenMed — fine-tuning recipe and methodology
License
Apache 2.0. Training data: Apache 2.0 (Gretel), CC-BY-4.0 (Nemotron).
Citation
@misc{msdakot_fintech_privacy_filter_2026,
author = {msdakot},
title = {{fintech-privacy-filter-v0}: Fintech-specialized PII detection with 65 entity categories},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/msdakot/fintech-privacy-filter-v0}}
}
- Downloads last month
- 32
Model tree for msdakot/fintech-privacy-filter-v0
Base model
openai/privacy-filter