Vietnamese Address Normalizer

Normalizes Vietnamese address strings to canonical post-2025 administrative form. Uses a Transformer seq2seq model with province-constrained beam search over 187K clean canonical addresses.

Quick Start

git clone https://huggingface.co/<your-username>/vn-address-normalizer
cd vn-address-normalizer
pip install -r requirements.txt
python inference.py "p tan dinh q1 tphcm"

Expected output:

Input:       p tan dinh q1 tphcm
Canonical:   Phường Tân Định, Thành phố Hồ Chí Minh
Valid:       True
Province:    Thành phố Hồ Chí Minh
Ward hint:   tan dinh
Space:       170 candidates
Latency:     ~150 ms (warm), ~1.5 s (cold — model load + trie build)

Python API

from inference import normalize

# With or without Vietnamese diacritics
result = normalize("p tan dinh q1 tphcm")
result = normalize("Phường Tân Định, Quận 1, TP.HCM")
result = normalize("Xa Cu Chi TP HCM")

print(result["canonical"])  # "Phường Tân Định, Thành phố Hồ Chí Minh"
print(result["valid"])      # True
print(result["latency_ms"]) # ~150 ms warm

Return fields

Field Type Description
canonical str Normalized address; empty if not found
valid bool True if canonical exists in address database
confidence float Log-prob score (higher = more confident)
province str Resolved province name, or None
ward_hint str Detected ward slug, or None
search_space int Number of trie candidates searched
latency_ms float Wall-clock inference time in milliseconds

Input / Output Examples

Input Output
p tan dinh q1 tphcm Phường Tân Định, Thành phố Hồ Chí Minh
Phuong Ba Dinh Ha Noi Phường Ba Đình, Thành phố Hà Nội
Xa Cu Chi TP HCM Xã Củ Chi, Thành phố Hồ Chí Minh
P. Bến Nghé Q.1 HCM Phường Sài Gòn, Thành phố Hồ Chí Minh (pre-2025 rename)
phuong 14 quan 10 tphcm (empty — numbered wards removed post-2025)
duong le loi phuong ben nghe q1 tphcm Đường Lê Lợi, Phường Sài Gòn, Thành phố Hồ Chí Minh

Architecture

Seq2Seq Transformer (26M parameters):

  • Encoder: 4-layer, 256-dim, 4-head, GELU, Pre-LN
  • Decoder: 3-layer, same config
  • Input: character-level tokenization (287-token vocab)
  • Output: character-level decoding (269-token vocab)

Inference pipeline:

  1. Detect province from raw text (regex + alias table)
  2. Detect ward hint (regex + ward slug index + legacy ward map for pre-2025 names)
  3. Build constrained trie: ward candidates (10–500) → province candidates (3K–52K) → full (~187K)
  4. Province-constrained beam search (beam=5, max 96 steps)
  5. Result is guaranteed in the canonical address database (trie acceptance)

Training data:

  • 187K clean canonical addresses (34 provinces, 3,321 wards, post-2025 boundaries)
  • Coverage: 79.3% of Vietnamese address variants

Files

File Description
inference.py Standalone inference — the only file you need to run
model_v3_final/model.safetensors Model weights (26M params)
model_v3_final/config.json Model hyperparameters
model_v3_final/src_vocab.json Source character vocabulary
model_v3_final/tgt_vocab.json Target character vocabulary
model_v3_final/clean_canonicals.json 187K pre-filtered canonical addresses
model_v3_final/legacy_ward_idx.json Pre-2025 → 2025 ward name mapping (13K entries)

Limitations

  • ML-only mode. inference.py uses the neural model without the rule-based FST engine (which requires the vietnam_provinces package and a heavier index build). For production use with higher accuracy, integrate normalizer.py + fst.py from the full server-side stack.

  • Post-2025 boundaries only. Canonical addresses reflect the 2025 administrative reorganization. Pre-2025 ward names (e.g. "Bến Nghé" → "Sài Gòn") are handled via the legacy map, but coverage is not exhaustive.

  • Numbered wards rejected. Wards identified only by number (e.g. "Phường 14", "Quận 10") are rejected — these no longer exist post-2025.

  • 2-variable input. Model handles up to 3 address components (street, ward, province). Highly complex or apartment-style addresses may not simplify cleanly.

  • Cold start latency. First call takes ~2–4 s (model load + trie construction). Subsequent calls: ~10–150 ms depending on search space.

  • CPU inference only. GPU not required or used.

Requirements

torch>=1.13.0
unidecode>=1.3.0
safetensors>=0.3.0

Python 3.10+ required (uses structural pattern matching in pipeline internals).

Downloads last month
119
Safetensors
Model size
6.59M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support