Vietnamese Address Normalizer

Normalizes Vietnamese address strings to canonical post-2025 administrative form. Uses a Transformer seq2seq model with province-constrained beam search over 187K clean canonical addresses.

Quick Start

git clone https://huggingface.co/<your-username>/vn-address-normalizer
cd vn-address-normalizer
pip install -r requirements.txt
python inference.py "p tan dinh q1 tphcm"

Expected output:

Input:       p tan dinh q1 tphcm
Canonical:   Phường Tân Định, Thành phố Hồ Chí Minh
Valid:       True
Province:    Thành phố Hồ Chí Minh
Ward hint:   tan dinh
Space:       170 candidates
Latency:     ~150 ms (warm), ~1.5 s (cold — model load + trie build)

Python API

from inference import normalize

# With or without Vietnamese diacritics
result = normalize("p tan dinh q1 tphcm")
result = normalize("Phường Tân Định, Quận 1, TP.HCM")
result = normalize("Xa Cu Chi TP HCM")

print(result["canonical"])  # "Phường Tân Định, Thành phố Hồ Chí Minh"
print(result["valid"])      # True
print(result["latency_ms"]) # ~150 ms warm

Return fields

Field	Type	Description
`canonical`	str	Normalized address; empty if not found
`valid`	bool	True if canonical exists in address database
`confidence`	float	Log-prob score (higher = more confident)
`province`	str	Resolved province name, or None
`ward_hint`	str	Detected ward slug, or None
`search_space`	int	Number of trie candidates searched
`latency_ms`	float	Wall-clock inference time in milliseconds

Input / Output Examples

Input	Output
`p tan dinh q1 tphcm`	`Phường Tân Định, Thành phố Hồ Chí Minh`
`Phuong Ba Dinh Ha Noi`	`Phường Ba Đình, Thành phố Hà Nội`
`Xa Cu Chi TP HCM`	`Xã Củ Chi, Thành phố Hồ Chí Minh`
`P. Bến Nghé Q.1 HCM`	`Phường Sài Gòn, Thành phố Hồ Chí Minh` (pre-2025 rename)
`phuong 14 quan 10 tphcm`	(empty — numbered wards removed post-2025)
`duong le loi phuong ben nghe q1 tphcm`	`Đường Lê Lợi, Phường Sài Gòn, Thành phố Hồ Chí Minh`

Architecture

Seq2Seq Transformer (26M parameters):

Encoder: 4-layer, 256-dim, 4-head, GELU, Pre-LN
Decoder: 3-layer, same config
Input: character-level tokenization (287-token vocab)
Output: character-level decoding (269-token vocab)

Inference pipeline:

Detect province from raw text (regex + alias table)
Detect ward hint (regex + ward slug index + legacy ward map for pre-2025 names)
Build constrained trie: ward candidates (~~10–500) → province candidates (~~3K–52K) → full (~187K)
Province-constrained beam search (beam=5, max 96 steps)
Result is guaranteed in the canonical address database (trie acceptance)

Training data:

187K clean canonical addresses (34 provinces, 3,321 wards, post-2025 boundaries)
Coverage: 79.3% of Vietnamese address variants

Files

File	Description
`inference.py`	Standalone inference — the only file you need to run
`model_v3_final/model.safetensors`	Model weights (26M params)
`model_v3_final/config.json`	Model hyperparameters
`model_v3_final/src_vocab.json`	Source character vocabulary
`model_v3_final/tgt_vocab.json`	Target character vocabulary
`model_v3_final/clean_canonicals.json`	187K pre-filtered canonical addresses
`model_v3_final/legacy_ward_idx.json`	Pre-2025 → 2025 ward name mapping (13K entries)

Limitations

ML-only mode. inference.py uses the neural model without the rule-based FST engine (which requires the vietnam_provinces package and a heavier index build). For production use with higher accuracy, integrate normalizer.py + fst.py from the full server-side stack.
Post-2025 boundaries only. Canonical addresses reflect the 2025 administrative reorganization. Pre-2025 ward names (e.g. "Bến Nghé" → "Sài Gòn") are handled via the legacy map, but coverage is not exhaustive.
Numbered wards rejected. Wards identified only by number (e.g. "Phường 14", "Quận 10") are rejected — these no longer exist post-2025.
2-variable input. Model handles up to 3 address components (street, ward, province). Highly complex or apartment-style addresses may not simplify cleanly.
Cold start latency. First call takes ~2–4 s (model load + trie construction). Subsequent calls: ~10–150 ms depending on search space.
CPU inference only. GPU not required or used.

Requirements

torch>=1.13.0
unidecode>=1.3.0
safetensors>=0.3.0

Python 3.10+ required (uses structural pattern matching in pipeline internals).

Downloads last month: 119

Safetensors

Model size

6.59M params

Tensor type

F32

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support