Token Classification
Transformers
Safetensors
Spanish
English
Portuguese
modernbert
name-splitting
ner
names
Eval Results (legacy)
Instructions to use ittailup/tori2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ittailup/tori2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ittailup/tori2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ittailup/tori2") model = AutoModelForTokenClassification.from_pretrained("ittailup/tori2") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| pipeline_tag: token-classification | |
| tags: | |
| - name-splitting | |
| - ner | |
| - modernbert | |
| - names | |
| language: | |
| - es | |
| - en | |
| - pt | |
| license: mit | |
| model-index: | |
| - name: tori2-bilineal | |
| results: | |
| - task: | |
| type: token-classification | |
| name: Name Splitting (bilineal) | |
| dataset: | |
| type: custom | |
| name: Bilineal eval split (MX/CO/ES/PE/CL) | |
| metrics: | |
| - type: f1 | |
| value: 0.9948 | |
| name: F1 | |
| - type: precision | |
| value: 0.9948 | |
| name: Precision | |
| - type: recall | |
| value: 0.9949 | |
| name: Recall | |
| - name: tori2-unilineal | |
| results: | |
| - task: | |
| type: token-classification | |
| name: Name Splitting (unilineal) | |
| dataset: | |
| type: custom | |
| name: Unilineal eval split (AR/US/BR/PT) | |
| metrics: | |
| - type: f1 | |
| value: 0.9927 | |
| name: F1 | |
| - type: precision | |
| value: 0.9927 | |
| name: Precision | |
| - type: recall | |
| value: 0.9927 | |
| name: Recall | |
| # Tori v2 — Name Splitter | |
| ModernBERT-base (149M params) fine-tuned for splitting full name strings into | |
| **forenames** and **surnames** using BIO token classification. | |
| ## Evaluation Results | |
| | Variant | F1 | Precision | Recall | Eval Dataset | | |
| |---------|---:|----------:|-------:|--------------| | |
| | **bilineal** (default) | 0.9948 | 0.9948 | 0.9949 | MX/CO/ES/PE/CL names | | |
| | **unilineal** | 0.9927 | 0.9927 | 0.9927 | AR/US/BR/PT names | | |
| Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match. | |
| ## Variants | |
| | Variant | Countries | Surname Pattern | Subfolder | | |
| |---------|-----------|-----------------|-----------| | |
| | **bilineal** (default) | MX, CO, ES, PE, CL | Double surname (paternal + maternal) | `/` (root) | | |
| | **unilineal** | AR, US, BR, PT | Single surname | `unilineal/` | | |
| ## Usage | |
| ```python | |
| from tori.inference import load_pipeline, split_name | |
| # Default: bilineal model (double-surname countries) | |
| pipe = load_pipeline("ittailup/tori2") | |
| result = split_name(pipe, "Juan Carlos García López") | |
| print(result.forenames) # ['Juan', 'Carlos'] | |
| print(result.surnames) # ['García', 'López'] | |
| # Unilineal model (single-surname countries) | |
| pipe = load_pipeline("ittailup/tori2", variant="unilineal") | |
| result = split_name(pipe, "John Michael Smith") | |
| print(result.forenames) # ['John', 'Michael'] | |
| print(result.surnames) # ['Smith'] | |
| ``` | |
| ## Labels | |
| - `O` — Outside any name entity | |
| - `B-forenames` — Beginning of forename | |
| - `I-forenames` — Inside forename (continuation) | |
| - `B-surnames` — Beginning of surname | |
| - `I-surnames` — Inside surname (continuation) | |
| ## Important: Custom Aggregation Required | |
| This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is | |
| **not compatible** with HuggingFace's built-in `aggregation_strategy="simple"`. | |
| Use the `tori.inference` module which handles subword aggregation correctly, | |
| or use `aggregation_strategy="none"` and aggregate tokens yourself using | |
| character offsets. | |
| ## Training | |
| - **Base model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params) | |
| - **Training data**: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar | |
| - **Batch size**: 256 effective (128 x 2 gradient accumulation) | |
| - **Learning rate**: 5e-5, cosine schedule with 10% warmup | |
| - **Epochs**: 3 | |
| - **Precision**: bf16 | |
| - **Hardware**: NVIDIA A10G (AWS g5.xlarge) | |