Tiny Specialized Encoder Models Beat Popular LLMs at PII Entity Extraction
TLDR
- Achieves strong F1-score of 96.27% with just 68M parameters.
- Beats popular LLMs like GPT-4o-mini (78.69%) and DeepSeek-v4-flash (84.89%)
- Can detect 50+ PII entity types in both structured and unstructured texts
- Can handle text from various domains like Healthcare, Finance, Legal, Cybersecurity etc.
- Can run on CPU with high throughput and low latency.
- MIT licensed and ready for real world use.
Tiny Specialized Encoder Models for PII Entity Extraction
| Model | # Parameters | F1-Score |
|---|---|---|
| Ettin-17M-Nemotron-PII | 17M | 94.21 |
| Ettin-32M-Nemotron-PII | 32M | 95.73 |
| Ettin-17M-Nemotron-PII | 68M | 96.27 |
Usage
# First install Hugging Face transformers library
!pip install transformers
# Initialize and run the PII detection pipeline to extract PII entities
from transformers import pipeline
## Initialize the PII detection pipeline
ner = pipeline("ner", model="kalyan-ks/ettin-68m-nemotron-pii", aggregation_strategy="simple")
input_text = "Kalyan KS is from India. His email id is kalyan.ks@yahoo.com"
## Run the PII detection to extract PII entities
pii_entities = ner(input_text)
## Process the extracted PII entities
def format_pii_entities(entities, original_text):
if not entities:
return []
merged_entities = []
entities = sorted(entities, key=lambda x: x['start'])
current_entity = {
'start': entities[0]['start'],
'end': entities[0]['end'],
'label': entities[0]['entity_group'],
'text': entities[0]['word']
}
for next_ent in entities[1:]:
is_same_label = next_ent['entity_group'] == current_entity['label']
is_adjacent = next_ent['start'] <= current_entity['end'] + 1
if is_same_label and is_adjacent:
current_entity['end'] = max(current_entity['end'], next_ent['end'])
current_entity['text'] = original_text[current_entity['start']:current_entity['end']]
else:
merged_entities.append(clean_entity(current_entity))
current_entity = {
'start': next_ent['start'],
'end': next_ent['end'],
'label': next_ent['entity_group'],
'text': next_ent['word']
}
merged_entities.append(clean_entity(current_entity))
return merged_entities
def clean_entity(ent):
raw_text = ent['text']
stripped_text = raw_text.strip()
leading_spaces = len(raw_text) - len(raw_text.lstrip())
return {
'start': ent['start'] + leading_spaces,
'end': ent['start'] + leading_spaces + len(stripped_text),
'text': stripped_text,
'label': ent['label']
}
# Display the extracted PII entities
formatted_entities = format_pii_entities(pii_entities, input_text)
print(formatted_entities)
# Output
[{'start': 0, 'end': 9, 'text': 'Kalyan KS', 'label': 'first_name'}, {'start': 18, 'end': 23, 'text': 'India', 'label': 'country'}, {'start': 41, 'end': 60, 'text': 'kalyan.ks@yahoo.com', 'label': 'email'}]
Supported PII Entity Types
These tiny specialized encoder models can detect the following 55 PII entity types
PII entity types with description
| Entity | Description |
|---|---|
| account_number | Account Number |
| age | Age |
| api_key | API Key |
| bank_routing_number | Bank Routing Number |
| biometric_identifier | Biometric Identifier |
| blood_type | Blood Type |
| certificate_license_number | Certificate or License Number |
| city | City |
| company_name | Company Name |
| coordinate | Geographic Coordinate |
| country | Country |
| county | County |
| credit_debit_card | Credit or Debit Card Number |
| customer_id | Customer ID |
| cvv | Card Verification Value (CVV) |
| date | Date |
| date_of_birth | Date of Birth |
| date_time | Date and Time |
| device_identifier | Device Identifier |
| education_level | Education Level |
| Email Address | |
| employee_id | Employee ID |
| employment_status | Employment Status |
| fax_number | Fax Number |
| first_name | First Name |
| gender | Gender |
| health_plan_beneficiary_number | Health Plan Beneficiary Number |
| http_cookie | HTTP Cookie |
| ipv4 | IPv4 Address |
| ipv6 | IPv6 Address |
| language | Language |
| last_name | Last Name |
| license_plate | Vehicle License Plate |
| mac_address | MAC Address |
| medical_record_number | Medical Record Number |
| national_id | National Identification Number |
| occupation | Occupation |
| password | Password |
| phone_number | Phone Number |
| pin | Personal Identification Number (PIN) |
| political_view | Political View |
| postcode | Postcode / Zip Code |
| race_ethnicity | Race or Ethnicity |
| religious_belief | Religious Belief |
| sexuality | Sexuality / Sexual Orientation |
| ssn | Social Security Number |
| state | State |
| street_address | Street Address |
| swift_bic | SWIFT / BIC Code |
| tax_id | Tax Identification Number |
| time | Time |
| unique_id | Unique Identifier |
| url | URL / Web Address |
| user_name | Username |
| vehicle_identifier | Vehicle Identification Number (VIN) |
Motivation
Issues with prompt-based LLMs for PII Entity Extraction
- LLMs like GPT-4o-mini and DeepSeek-v4-flash are general purpose but not specifically trained for PII entity extraction. So the performance of these prompt-based general purpose LLMs is limited.
- High API costs for large scale usage.
- High latency because of large model sizes
Issues with fine-tuned open-source LLMs for PII Entity Extraction
- High F1-scores but requires high end GPUs to train and serve.
- High latency and serving costs.
PII entity extraction is fundamentally a token level classification. Existing research [2] shows that encoder models perform better at classification tasks compared to decoder models. Based on this, we focused on training tiny specialized models based on Ettin-encoder models for PII entity extraction
Why Ettin Encoder Models?
Ettin Encoder models are SOTA transformer models with the following key features
- Long context window (8K tokens).
- Well suited for classification tasks including token classification.
- Built using modern transformer design (RoPE, GLU activations and prenorm layers).
- SOTA encoder models beating ModernBERT.
A brief summary of tiny Ettin Encoder models (17M-68M)
| Model | Parameters | Layers | Context | Key Aspect |
|---|---|---|---|---|
| Ettin-Encoder-17M | 17M | 7 | 8192 tokens | Lightweight Inference |
| Ettin-Encoder-32M | 32M | 10 | 8192 tokens | Speed-performance balance |
| Ettin-Encoder-68M | 68M | 19 | 8192 tokens | Strong performance |
Evaluation Results
The evaluation is done using a 10k sample test set from the Neomotron PII test set. We compared the performance of our tiny specialized encoder models with two popular LLMs namely GPT-4o-mini (proprietary LLM) and DeepSeek-v4-Flash (open source LLM).
Here are the evaluation results
| Model | Precision | Recall | F1-Score |
|---|---|---|---|
| GPT-4o-mini | 95.21 | 67.05 | 78.69 |
| DeepSeek-v4-Flash | 95.60 | 76.33 | 84.89 |
| Ettin-17M-Nemotron-PII | 94.48 | 93.93 | 94.21 |
| Ettin-32M-Nemotron-PII | 95.96 | 95.49 | 95.73 |
| Ettin-68M-Nemotron-PII | 96.35 | 96.19 | 96.27 |
Key Insights from Evaluation Results
GPT-4o-mini achieved high precision (95.21%) but relatively low recall (67.05%).
DeepSeek-v4-Flash achieved a better recall over GPT-4o-mini (76.33% vs. 67.05%) while maintaining high precision, resulting in a higher F1-score (84.89%).
In case of both these prompt-based general purpose LLMs (i) high precision shows that many of the entities flagged by these models as PII entities are actually PII entities (ii) low recall shows that these models missed a substantial number of PII entities. So use of these models for PII entity extraction is not recommended because they miss a substantial number of PII entities resulting in sensitive data leakage.
All the three fine-tuned Ettin-Encoder models (17M-68M parameters) outperformed the prompt-based general-purpose LLMs (GPT-4o-mini and DeepSeek-v4-Flash) by a good margin. This demonstrates the effectiveness of task-specific encoder models for PII entity extraction.
Ettin-17M-Nemotron-PII with both precision and recall above 93% delivers balanced and production-ready performance indicating strong PII entity extraction consistency even at a relatively small model size.
Increasing model size within the Ettin-Encoder family consistently improves performance i.e., 17M → 32M → 68M shows steady gains in precision, recall, and F1-score.
Ettin-68M-Nemotron-PII achieved the best overall performance: Precision: 96.35%, Recall: 96.19% and F1-score: 96.27%. The near-equal precision and recall values indicate an excellent balance between minimizing false positives and false negatives.
To summarize, the evaluation results show that specialized lightweight encoder models can outperform larger general-purpose prompt-based models at PII entity extraction while potentially requiring lower inference cost and latency.
Top Performing PII Entity Types
Here are the top performing PII entity types in the case of Ettin-68M-Nemotron-PII model
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| biometric_identifier | 0.9963 | 0.9966 | 0.9964 |
| date_of_birth | 0.9952 | 0.9963 | 0.9957 |
| api_key | 0.9932 | 0.9978 | 0.9955 |
| mac_address | 0.9929 | 0.9965 | 0.9947 |
| 0.9942 | 0.9942 | 0.9942 | |
| ipv4 | 0.9950 | 0.9933 | 0.9941 |
| medical_record_number | 0.9952 | 0.9904 | 0.9928 |
| health_plan_beneficiary_number | 0.9924 | 0.9925 | 0.9924 |
| vehicle_identifier | 0.9867 | 0.9977 | 0.9922 |
| bank_routing_number | 0.9967 | 0.9862 | 0.9914 |
Challenging PII Entity Types
Here are the challenging PII entity types in the case of Ettin-68M-Nemotron-PII model
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| occupation | 0.7200 | 0.5493 | 0.6232 |
| time | 0.8094 | 0.7781 | 0.7934 |
| age | 0.8333 | 0.9273 | 0.8778 |
| political_view | 0.8533 | 0.9247 | 0.8876 |
| state | 0.9077 | 0.8792 | 0.8932 |
| fax_number | 0.9047 | 0.9013 | 0.9030 |
| company_name | 0.9048 | 0.9072 | 0.9060 |
| national_id | 0.8995 | 0.9224 | 0.9108 |
| education_level | 0.9269 | 0.8973 | 0.9118 |
| race_ethnicity | 0.9027 | 0.9388 | 0.9204 |
Limitations
- Language: These models work well only for English language texts.
- Challenging PII Entity Types: Some of the entity types like occupation have low F1 score.