Tiny Specialized Encoder Models Beat Popular LLMs at PII Entity Extraction

Community Article Published May 12, 2026

Upvote

Kalyan KS

kalyan-ks

Tiny specialized encoder models beat LLMs at PII entity extraction

TLDR

Achieves strong F1-score of 96.27% with just 68M parameters.
Beats popular LLMs like GPT-4o-mini (78.69%) and DeepSeek-v4-flash (84.89%)
Can detect 50+ PII entity types in both structured and unstructured texts
Can handle text from various domains like Healthcare, Finance, Legal, Cybersecurity etc.
Can run on CPU with high throughput and low latency.
MIT licensed and ready for real world use.

Tiny Specialized Encoder Models for PII Entity Extraction

Model	# Parameters	F1-Score
Ettin-17M-Nemotron-PII	17M	94.21
Ettin-32M-Nemotron-PII	32M	95.73
Ettin-17M-Nemotron-PII	68M	96.27

Usage


# First install Hugging Face transformers library
!pip install transformers

# Initialize and run the PII detection pipeline to extract PII entities
from transformers import pipeline

## Initialize the PII detection pipeline
ner = pipeline("ner", model="kalyan-ks/ettin-68m-nemotron-pii", aggregation_strategy="simple")

input_text = "Kalyan KS is from India. His email id is kalyan.ks@yahoo.com"

## Run the PII detection to extract PII entities
pii_entities = ner(input_text)

## Process the extracted PII entities 
def format_pii_entities(entities, original_text):
    if not entities:
        return []

    merged_entities = []

    entities = sorted(entities, key=lambda x: x['start'])

    current_entity = {
        'start': entities[0]['start'],
        'end': entities[0]['end'],
        'label': entities[0]['entity_group'],
        'text': entities[0]['word']
    }

    for next_ent in entities[1:]:
        is_same_label = next_ent['entity_group'] == current_entity['label']
        is_adjacent = next_ent['start'] <= current_entity['end'] + 1

        if is_same_label and is_adjacent:
            current_entity['end'] = max(current_entity['end'], next_ent['end'])
            current_entity['text'] = original_text[current_entity['start']:current_entity['end']]
        else:
            merged_entities.append(clean_entity(current_entity))
            current_entity = {
                'start': next_ent['start'],
                'end': next_ent['end'],
                'label': next_ent['entity_group'],
                'text': next_ent['word']
            }

    merged_entities.append(clean_entity(current_entity))
    return merged_entities

def clean_entity(ent):

    raw_text = ent['text']
    stripped_text = raw_text.strip()
    leading_spaces = len(raw_text) - len(raw_text.lstrip())

    return {
        'start': ent['start'] + leading_spaces,
        'end': ent['start'] + leading_spaces + len(stripped_text),
        'text': stripped_text,
        'label': ent['label']
    }

# Display the extracted PII entities
formatted_entities = format_pii_entities(pii_entities, input_text)
print(formatted_entities)

# Output
[{'start': 0, 'end': 9, 'text': 'Kalyan KS', 'label': 'first_name'}, {'start': 18, 'end': 23, 'text': 'India', 'label': 'country'}, {'start': 41, 'end': 60, 'text': 'kalyan.ks@yahoo.com', 'label': 'email'}]

Supported PII Entity Types

These tiny specialized encoder models can detect the following 55 PII entity types

PII entity types with description

Entity	Description
account_number	Account Number
age	Age
api_key	API Key
bank_routing_number	Bank Routing Number
biometric_identifier	Biometric Identifier
blood_type	Blood Type
certificate_license_number	Certificate or License Number
city	City
company_name	Company Name
coordinate	Geographic Coordinate
country	Country
county	County
credit_debit_card	Credit or Debit Card Number
customer_id	Customer ID
cvv	Card Verification Value (CVV)
date	Date
date_of_birth	Date of Birth
date_time	Date and Time
device_identifier	Device Identifier
education_level	Education Level
email	Email Address
employee_id	Employee ID
employment_status	Employment Status
fax_number	Fax Number
first_name	First Name
gender	Gender
health_plan_beneficiary_number	Health Plan Beneficiary Number
http_cookie	HTTP Cookie
ipv4	IPv4 Address
ipv6	IPv6 Address
language	Language
last_name	Last Name
license_plate	Vehicle License Plate
mac_address	MAC Address
medical_record_number	Medical Record Number
national_id	National Identification Number
occupation	Occupation
password	Password
phone_number	Phone Number
pin	Personal Identification Number (PIN)
political_view	Political View
postcode	Postcode / Zip Code
race_ethnicity	Race or Ethnicity
religious_belief	Religious Belief
sexuality	Sexuality / Sexual Orientation
ssn	Social Security Number
state	State
street_address	Street Address
swift_bic	SWIFT / BIC Code
tax_id	Tax Identification Number
time	Time
unique_id	Unique Identifier
url	URL / Web Address
user_name	Username
vehicle_identifier	Vehicle Identification Number (VIN)

Motivation

Issues with prompt-based LLMs for PII Entity Extraction

LLMs like GPT-4o-mini and DeepSeek-v4-flash are general purpose but not specifically trained for PII entity extraction. So the performance of these prompt-based general purpose LLMs is limited.
High API costs for large scale usage.
High latency because of large model sizes

Issues with fine-tuned open-source LLMs for PII Entity Extraction

High F1-scores but requires high end GPUs to train and serve.
High latency and serving costs.

PII entity extraction is fundamentally a token level classification. Existing research [2] shows that encoder models perform better at classification tasks compared to decoder models. Based on this, we focused on training tiny specialized models based on Ettin-encoder models for PII entity extraction

Why Ettin Encoder Models?

Ettin Encoder models are SOTA transformer models with the following key features

Long context window (8K tokens).
Well suited for classification tasks including token classification.
Built using modern transformer design (RoPE, GLU activations and prenorm layers).
SOTA encoder models beating ModernBERT.

A brief summary of tiny Ettin Encoder models (17M-68M)

Model	Parameters	Layers	Context	Key Aspect
Ettin-Encoder-17M	17M	7	8192 tokens	Lightweight Inference
Ettin-Encoder-32M	32M	10	8192 tokens	Speed-performance balance
Ettin-Encoder-68M	68M	19	8192 tokens	Strong performance

Evaluation Results

The evaluation is done using a 10k sample test set from the Neomotron PII test set. We compared the performance of our tiny specialized encoder models with two popular LLMs namely GPT-4o-mini (proprietary LLM) and DeepSeek-v4-Flash (open source LLM).

Here are the evaluation results

Model	Precision	Recall	F1-Score
GPT-4o-mini	95.21	67.05	78.69
DeepSeek-v4-Flash	95.60	76.33	84.89
Ettin-17M-Nemotron-PII	94.48	93.93	94.21
Ettin-32M-Nemotron-PII	95.96	95.49	95.73
Ettin-68M-Nemotron-PII	96.35	96.19	96.27

Key Insights from Evaluation Results

GPT-4o-mini achieved high precision (95.21%) but relatively low recall (67.05%).
DeepSeek-v4-Flash achieved a better recall over GPT-4o-mini (76.33% vs. 67.05%) while maintaining high precision, resulting in a higher F1-score (84.89%).
In case of both these prompt-based general purpose LLMs (i) high precision shows that many of the entities flagged by these models as PII entities are actually PII entities (ii) low recall shows that these models missed a substantial number of PII entities. So use of these models for PII entity extraction is not recommended because they miss a substantial number of PII entities resulting in sensitive data leakage.
All the three fine-tuned Ettin-Encoder models (17M-68M parameters) outperformed the prompt-based general-purpose LLMs (GPT-4o-mini and DeepSeek-v4-Flash) by a good margin. This demonstrates the effectiveness of task-specific encoder models for PII entity extraction.
Ettin-17M-Nemotron-PII with both precision and recall above 93% delivers balanced and production-ready performance indicating strong PII entity extraction consistency even at a relatively small model size.
Increasing model size within the Ettin-Encoder family consistently improves performance i.e., 17M → 32M → 68M shows steady gains in precision, recall, and F1-score.
Ettin-68M-Nemotron-PII achieved the best overall performance: Precision: 96.35%, Recall: 96.19% and F1-score: 96.27%. The near-equal precision and recall values indicate an excellent balance between minimizing false positives and false negatives.

To summarize, the evaluation results show that specialized lightweight encoder models can outperform larger general-purpose prompt-based models at PII entity extraction while potentially requiring lower inference cost and latency.

Top Performing PII Entity Types

Here are the top performing PII entity types in the case of Ettin-68M-Nemotron-PII model

Entity	Precision	Recall	F1
biometric_identifier	0.9963	0.9966	0.9964
date_of_birth	0.9952	0.9963	0.9957
api_key	0.9932	0.9978	0.9955
mac_address	0.9929	0.9965	0.9947
email	0.9942	0.9942	0.9942
ipv4	0.9950	0.9933	0.9941
medical_record_number	0.9952	0.9904	0.9928
health_plan_beneficiary_number	0.9924	0.9925	0.9924
vehicle_identifier	0.9867	0.9977	0.9922
bank_routing_number	0.9967	0.9862	0.9914

Challenging PII Entity Types

Here are the challenging PII entity types in the case of Ettin-68M-Nemotron-PII model

Entity	Precision	Recall	F1
occupation	0.7200	0.5493	0.6232
time	0.8094	0.7781	0.7934
age	0.8333	0.9273	0.8778
political_view	0.8533	0.9247	0.8876
state	0.9077	0.8792	0.8932
fax_number	0.9047	0.9013	0.9030
company_name	0.9048	0.9072	0.9060
national_id	0.8995	0.9224	0.9108
education_level	0.9269	0.8973	0.9118
race_ethnicity	0.9027	0.9388	0.9204

Limitations

Language: These models work well only for English language texts.
Challenging PII Entity Types: Some of the entity types like occupation have low F1 score.

References

Models mentioned in this article 3

Datasets mentioned in this article 1

Papers mentioned in this article 1

Collections mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote