ittailup
/

tori2

Token Classification

Eval Results (legacy)

Model card Files Files and versions

tori2 / README.md

ittailup's picture

Add evaluation metrics to model card

ea5f3d9 verified 3 months ago

|

history blame contribute delete

3.57 kB

	---
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- name-splitting
	- ner
	- modernbert
	- names
	language:
	- es
	- en
	- pt
	license: mit
	model-index:
	- name: tori2-bilineal
	results:
	- task:
	type: token-classification
	name: Name Splitting (bilineal)
	dataset:
	type: custom
	name: Bilineal eval split (MX/CO/ES/PE/CL)
	metrics:
	- type: f1
	value: 0.9948
	name: F1
	- type: precision
	value: 0.9948
	name: Precision
	- type: recall
	value: 0.9949
	name: Recall
	- name: tori2-unilineal
	results:
	- task:
	type: token-classification
	name: Name Splitting (unilineal)
	dataset:
	type: custom
	name: Unilineal eval split (AR/US/BR/PT)
	metrics:
	- type: f1
	value: 0.9927
	name: F1
	- type: precision
	value: 0.9927
	name: Precision
	- type: recall
	value: 0.9927
	name: Recall
	---

	# Tori v2 — Name Splitter

	ModernBERT-base (149M params) fine-tuned for splitting full name strings into
	forenames and surnames using BIO token classification.

	## Evaluation Results

	\| Variant \| F1 \| Precision \| Recall \| Eval Dataset \|
	\|---------\|---:\|----------:\|-------:\|--------------\|
	\| bilineal (default) \| 0.9948 \| 0.9948 \| 0.9949 \| MX/CO/ES/PE/CL names \|
	\| unilineal \| 0.9927 \| 0.9927 \| 0.9927 \| AR/US/BR/PT names \|

	Metrics are entity-level (seqeval) — a name span is only correct if all its tokens match.

	## Variants

	\| Variant \| Countries \| Surname Pattern \| Subfolder \|
	\|---------\|-----------\|-----------------\|-----------\|
	\| bilineal (default) \| MX, CO, ES, PE, CL \| Double surname (paternal + maternal) \| `/` (root) \|
	\| unilineal \| AR, US, BR, PT \| Single surname \| `unilineal/` \|

	## Usage

	```python
	from tori.inference import load_pipeline, split_name

	# Default: bilineal model (double-surname countries)
	pipe = load_pipeline("ittailup/tori2")
	result = split_name(pipe, "Juan Carlos García López")
	print(result.forenames) # ['Juan', 'Carlos']
	print(result.surnames) # ['García', 'López']

	# Unilineal model (single-surname countries)
	pipe = load_pipeline("ittailup/tori2", variant="unilineal")
	result = split_name(pipe, "John Michael Smith")
	print(result.forenames) # ['John', 'Michael']
	print(result.surnames) # ['Smith']
	```

	## Labels

	- `O` — Outside any name entity
	- `B-forenames` — Beginning of forename
	- `I-forenames` — Inside forename (continuation)
	- `B-surnames` — Beginning of surname
	- `I-surnames` — Inside surname (continuation)

	## Important: Custom Aggregation Required

	This model uses ModernBERT's GPT-style BPE tokenizer (Ġ prefix), which is
	not compatible with HuggingFace's built-in `aggregation_strategy="simple"`.
	Use the `tori.inference` module which handles subword aggregation correctly,
	or use `aggregation_strategy="none"` and aggregate tokens yourself using
	character offsets.

	## Training

	- Base model: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) (149M params)
	- Training data: [philipperemy/name-dataset](https://github.com/philipperemy/name-dataset), Mexico SEP, RENAPER (AR), datos.gob.ar
	- Batch size: 256 effective (128 x 2 gradient accumulation)
	- Learning rate: 5e-5, cosine schedule with 10% warmup
	- Epochs: 3
	- Precision: bf16
	- Hardware: NVIDIA A10G (AWS g5.xlarge)