Token Classification
Safetensors
privacy_filter
pii
fintech
ner
privacy

msdakot/fintech-privacy-filter-v0

fintech-privacy-filter-v0 is a fine-tuned version of openai/privacy-filter specialized for fintech PII detection across 65 entity categories — extending the base model with 10 new financial identifiers (IBAN, LEI, ISIN, CUSIP, VAT numbers, etc.) and multilingual coverage across 7 European languages.

Key Specifications

  • Base model: openai/privacy-filter — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
  • Task: Token classification for PII detection (BIOES scheme)
  • Training data: 55,121 examples from Gretel, Nemotron (finance-domain), and synthetic fintech supplement
  • Fine-tuning recipe: opf train — full fine-tune, AdamW, lr=1e-4, 3 epochs, bf16, weight_decay=0.0
  • Labels: 65 entity categories → 261 BIOES classes
  • Languages: English, Spanish, Swedish, German, Italian, Dutch, French

Quick Start

With OpenAI's opf CLI

pip install 'opf @ git+https://github.com/openai/privacy-filter.git'

opf redact \
  --checkpoint msdakot/fintech-privacy-filter-v0 \
  --text "Wire €50,000 to IBAN DE89370400440532013000, LEI 529900T8BM49AURSDO55."

With Transformers

from transformers import pipeline

pipe = pipeline(
    "token-classification",
    model="msdakot/fintech-privacy-filter-v0",
    aggregation_strategy="simple",
    trust_remote_code=True,
)

text = "Wire €50,000 to IBAN DE89370400440532013000, LEI 529900T8BM49AURSDO55, ISIN US0378331005."
for r in pipe(text):
    print(f"{r['word']!r:45s} → {r['entity_group']}  (score={r['score']:.3f})")

Note: This model uses the o200k_base tiktoken tokenizer (same as GPT-4o). Pass trust_remote_code=True for both the tokenizer and model.


Performance

Evaluated with opf eval (Viterbi decoding) on 6,891 held-out test examples:

Metric Value
Span F1 (overall) 0.800
Span Precision 0.848
Span Recall 0.758
Token accuracy 0.977

Fintech Label Performance (New Labels)

Label Span F1
cusip 1.000
isin 1.000
lei 1.000
vat_number 1.000
bban 0.912
iban 0.870
swift_mt_ref —
policy_number —
loan_number —
crypto_address —

4 fintech labels have no test examples in v0 — coverage planned for v1.


Label Space (65 Categories)

New Fintech Labels (10)

Label Description
iban International Bank Account Number (ISO 13616)
bban Basic Bank Account Number (domestic IBAN component)
lei Legal Entity Identifier (ISO 17442)
isin International Securities Identification Number (ISO 6166)
cusip North American securities identifier
swift_mt_ref SWIFT MT message reference numbers
policy_number Insurance policy references
vat_number EU VAT identification numbers
loan_number Mortgage and loan account references
crypto_address Bitcoin / Ethereum / Solana wallet addresses

General PII Labels (55)

Identity: first_name, last_name, user_name, age, gender, race_ethnicity, sexuality, religious_belief, political_view, marital_status, nationality, education_level, occupation, employment_status, language, blood_type, biometric_identifier

Contact: email, phone_number, fax_number, url

Address: street_address, city, county, state, country, postcode, coordinate

Dates: date, date_of_birth, date_time, time

Government IDs: ssn, national_id, tax_id

Financial: account_number, bank_routing_number, swift_bic, credit_debit_card, cvv, pin, password

Healthcare: medical_record_number, health_plan_beneficiary_number

Enterprise: customer_id, employee_id, unique_id, certificate_license_number, company_name

Vehicle: license_plate, vehicle_identifier

Digital: ipv4, ipv6, mac_address, device_identifier, api_key, http_cookie


Training

Parameter Value
learning_rate 1e-4
epochs 3
batch_size 2
grad_accumulation_steps 4 (effective batch = 8)
dtype bf16
weight_decay 0.0
Hardware Google Colab T4, High-RAM runtime
Runtime 5.9 hours

Data Sources

Dataset License Records used
gretelai/synthetic_pii_finance_multilingual Apache 2.0 50,346
nvidia/Nemotron-PII CC-BY-4.0 24,866 (finance-domain filtered)
Synthetic supplement Apache 2.0 390

Harmonized dataset: msdakot/fintech-privacy-pii


Limitations

  • Synthetic training data only — real financial documents may have different surface forms.
  • Name label simplification — Gretel's full-name spans are mapped to first_name; last_name recall is lower as a result (known v0 limitation).
  • Language coverage — 7 European languages only; not tested on others.
  • Not a legal compliance substitute — use alongside deterministic regex filters and human review.

Credits

  • OpenAI — Privacy Filter architecture, modeling code, and opf CLI
  • Gretel AI — synthetic_pii_finance_multilingual dataset
  • NVIDIA — Nemotron-PII dataset
  • OpenMed — fine-tuning recipe and methodology

License

Apache 2.0. Training data: Apache 2.0 (Gretel), CC-BY-4.0 (Nemotron).


Citation

@misc{msdakot_fintech_privacy_filter_2026,
  author       = {msdakot},
  title        = {{fintech-privacy-filter-v0}: Fintech-specialized PII detection with 65 entity categories},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/msdakot/fintech-privacy-filter-v0}}
}
Downloads last month
32
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for msdakot/fintech-privacy-filter-v0

Finetuned
(35)
this model

Datasets used to train msdakot/fintech-privacy-filter-v0