Tiny Specialized Encoder Models Beat Popular LLMs at PII Entity Extraction

Community Article Published May 12, 2026

Tiny specialized encoder models beat LLMs at PII entity extraction

TLDR

  • Achieves strong F1-score of 96.27% with just 68M parameters.
  • Beats popular LLMs like GPT-4o-mini (78.69%) and DeepSeek-v4-flash (84.89%)
  • Can detect 50+ PII entity types in both structured and unstructured texts
  • Can handle text from various domains like Healthcare, Finance, Legal, Cybersecurity etc.
  • Can run on CPU with high throughput and low latency.
  • MIT licensed and ready for real world use.

Tiny Specialized Encoder Models for PII Entity Extraction

Model # Parameters F1-Score
Ettin-17M-Nemotron-PII 17M 94.21
Ettin-32M-Nemotron-PII 32M 95.73
Ettin-17M-Nemotron-PII 68M 96.27

Usage


# First install Hugging Face transformers library
!pip install transformers

# Initialize and run the PII detection pipeline to extract PII entities
from transformers import pipeline

## Initialize the PII detection pipeline
ner = pipeline("ner", model="kalyan-ks/ettin-68m-nemotron-pii", aggregation_strategy="simple")

input_text = "Kalyan KS is from India. His email id is kalyan.ks@yahoo.com"

## Run the PII detection to extract PII entities
pii_entities = ner(input_text)

## Process the extracted PII entities 
def format_pii_entities(entities, original_text):
    if not entities:
        return []

    merged_entities = []

    entities = sorted(entities, key=lambda x: x['start'])

    current_entity = {
        'start': entities[0]['start'],
        'end': entities[0]['end'],
        'label': entities[0]['entity_group'],
        'text': entities[0]['word']
    }

    for next_ent in entities[1:]:
        is_same_label = next_ent['entity_group'] == current_entity['label']
        is_adjacent = next_ent['start'] <= current_entity['end'] + 1

        if is_same_label and is_adjacent:
            current_entity['end'] = max(current_entity['end'], next_ent['end'])
            current_entity['text'] = original_text[current_entity['start']:current_entity['end']]
        else:
            merged_entities.append(clean_entity(current_entity))
            current_entity = {
                'start': next_ent['start'],
                'end': next_ent['end'],
                'label': next_ent['entity_group'],
                'text': next_ent['word']
            }

    merged_entities.append(clean_entity(current_entity))
    return merged_entities

def clean_entity(ent):

    raw_text = ent['text']
    stripped_text = raw_text.strip()
    leading_spaces = len(raw_text) - len(raw_text.lstrip())

    return {
        'start': ent['start'] + leading_spaces,
        'end': ent['start'] + leading_spaces + len(stripped_text),
        'text': stripped_text,
        'label': ent['label']
    }

# Display the extracted PII entities
formatted_entities = format_pii_entities(pii_entities, input_text)
print(formatted_entities)

# Output
[{'start': 0, 'end': 9, 'text': 'Kalyan KS', 'label': 'first_name'}, {'start': 18, 'end': 23, 'text': 'India', 'label': 'country'}, {'start': 41, 'end': 60, 'text': 'kalyan.ks@yahoo.com', 'label': 'email'}]

Supported PII Entity Types

These tiny specialized encoder models can detect the following 55 PII entity types

PII entity types with description
Entity Description
account_number Account Number
age Age
api_key API Key
bank_routing_number Bank Routing Number
biometric_identifier Biometric Identifier
blood_type Blood Type
certificate_license_number Certificate or License Number
city City
company_name Company Name
coordinate Geographic Coordinate
country Country
county County
credit_debit_card Credit or Debit Card Number
customer_id Customer ID
cvv Card Verification Value (CVV)
date Date
date_of_birth Date of Birth
date_time Date and Time
device_identifier Device Identifier
education_level Education Level
email Email Address
employee_id Employee ID
employment_status Employment Status
fax_number Fax Number
first_name First Name
gender Gender
health_plan_beneficiary_number Health Plan Beneficiary Number
http_cookie HTTP Cookie
ipv4 IPv4 Address
ipv6 IPv6 Address
language Language
last_name Last Name
license_plate Vehicle License Plate
mac_address MAC Address
medical_record_number Medical Record Number
national_id National Identification Number
occupation Occupation
password Password
phone_number Phone Number
pin Personal Identification Number (PIN)
political_view Political View
postcode Postcode / Zip Code
race_ethnicity Race or Ethnicity
religious_belief Religious Belief
sexuality Sexuality / Sexual Orientation
ssn Social Security Number
state State
street_address Street Address
swift_bic SWIFT / BIC Code
tax_id Tax Identification Number
time Time
unique_id Unique Identifier
url URL / Web Address
user_name Username
vehicle_identifier Vehicle Identification Number (VIN)

Motivation

Issues with prompt-based LLMs for PII Entity Extraction

  • LLMs like GPT-4o-mini and DeepSeek-v4-flash are general purpose but not specifically trained for PII entity extraction. So the performance of these prompt-based general purpose LLMs is limited.
  • High API costs for large scale usage.
  • High latency because of large model sizes

Issues with fine-tuned open-source LLMs for PII Entity Extraction

  • High F1-scores but requires high end GPUs to train and serve.
  • High latency and serving costs.

PII entity extraction is fundamentally a token level classification. Existing research [2] shows that encoder models perform better at classification tasks compared to decoder models. Based on this, we focused on training tiny specialized models based on Ettin-encoder models for PII entity extraction

Why Ettin Encoder Models?

Ettin Encoder models are SOTA transformer models with the following key features

  • Long context window (8K tokens).
  • Well suited for classification tasks including token classification.
  • Built using modern transformer design (RoPE, GLU activations and prenorm layers).
  • SOTA encoder models beating ModernBERT.

A brief summary of tiny Ettin Encoder models (17M-68M)

Model Parameters Layers Context Key Aspect
Ettin-Encoder-17M 17M 7 8192 tokens Lightweight Inference
Ettin-Encoder-32M 32M 10 8192 tokens Speed-performance balance
Ettin-Encoder-68M 68M 19 8192 tokens Strong performance

Evaluation Results

The evaluation is done using a 10k sample test set from the Neomotron PII test set. We compared the performance of our tiny specialized encoder models with two popular LLMs namely GPT-4o-mini (proprietary LLM) and DeepSeek-v4-Flash (open source LLM).

Here are the evaluation results

Model Precision Recall F1-Score
GPT-4o-mini 95.21 67.05 78.69
DeepSeek-v4-Flash 95.60 76.33 84.89
Ettin-17M-Nemotron-PII 94.48 93.93 94.21
Ettin-32M-Nemotron-PII 95.96 95.49 95.73
Ettin-68M-Nemotron-PII 96.35 96.19 96.27

Key Insights from Evaluation Results

  1. GPT-4o-mini achieved high precision (95.21%) but relatively low recall (67.05%).

  2. DeepSeek-v4-Flash achieved a better recall over GPT-4o-mini (76.33% vs. 67.05%) while maintaining high precision, resulting in a higher F1-score (84.89%).

  3. In case of both these prompt-based general purpose LLMs (i) high precision shows that many of the entities flagged by these models as PII entities are actually PII entities (ii) low recall shows that these models missed a substantial number of PII entities. So use of these models for PII entity extraction is not recommended because they miss a substantial number of PII entities resulting in sensitive data leakage.

  4. All the three fine-tuned Ettin-Encoder models (17M-68M parameters) outperformed the prompt-based general-purpose LLMs (GPT-4o-mini and DeepSeek-v4-Flash) by a good margin. This demonstrates the effectiveness of task-specific encoder models for PII entity extraction.

  5. Ettin-17M-Nemotron-PII with both precision and recall above 93% delivers balanced and production-ready performance indicating strong PII entity extraction consistency even at a relatively small model size.

  6. Increasing model size within the Ettin-Encoder family consistently improves performance i.e., 17M → 32M → 68M shows steady gains in precision, recall, and F1-score.

  7. Ettin-68M-Nemotron-PII achieved the best overall performance: Precision: 96.35%, Recall: 96.19% and F1-score: 96.27%. The near-equal precision and recall values indicate an excellent balance between minimizing false positives and false negatives.

To summarize, the evaluation results show that specialized lightweight encoder models can outperform larger general-purpose prompt-based models at PII entity extraction while potentially requiring lower inference cost and latency.

Top Performing PII Entity Types

Here are the top performing PII entity types in the case of Ettin-68M-Nemotron-PII model

Entity Precision Recall F1
biometric_identifier 0.9963 0.9966 0.9964
date_of_birth 0.9952 0.9963 0.9957
api_key 0.9932 0.9978 0.9955
mac_address 0.9929 0.9965 0.9947
email 0.9942 0.9942 0.9942
ipv4 0.9950 0.9933 0.9941
medical_record_number 0.9952 0.9904 0.9928
health_plan_beneficiary_number 0.9924 0.9925 0.9924
vehicle_identifier 0.9867 0.9977 0.9922
bank_routing_number 0.9967 0.9862 0.9914

Challenging PII Entity Types

Here are the challenging PII entity types in the case of Ettin-68M-Nemotron-PII model

Entity Precision Recall F1
occupation 0.7200 0.5493 0.6232
time 0.8094 0.7781 0.7934
age 0.8333 0.9273 0.8778
political_view 0.8533 0.9247 0.8876
state 0.9077 0.8792 0.8932
fax_number 0.9047 0.9013 0.9030
company_name 0.9048 0.9072 0.9060
national_id 0.8995 0.9224 0.9108
education_level 0.9269 0.8973 0.9118
race_ethnicity 0.9027 0.9388 0.9204

Limitations

  1. Language: These models work well only for English language texts.
  2. Challenging PII Entity Types: Some of the entity types like occupation have low F1 score.

References

  1. Ettin Encoder Family Models
  2. Ettin Encoder Family Models Paper
  3. Nemotron PII Dataset

Community

Sign up or log in to comment