thivux
/

PhoTextNormalization

Generated from Trainer

Eval Results (legacy)

Model card Files Files and versions

PhoTextNormalization / README.md

thivux's picture

Update README.md

da4f2ff verified 4 months ago

|

history blame contribute delete

2.06 kB

	---
	language:
	- vi
	- vi
	license: bsd-3-clause
	base_model: facebook/mbart-large-50
	tags:
	- generated_from_trainer
	metrics:
	- bleu
	model-index:
	- name: PhoTextNormalization
	results:
	- task:
	name: Translation
	type: translation
	metrics:
	- name: Bleu
	type: bleu
	value: 88.8267
	---

	# PhoTextNormalization: Text normalization model for Vietnamese

	PhoTextNormalization converts Vietnamese text from written to spoken form. For example, "Một tháng có 30 hoặc 31 ngày, riêng tháng 2 có 28 ngày." will be converted to "một tháng có ba mươi hoặc ba mươi mốt ngày, riêng tháng hai có hai tám ngày."

	Details of the training can be found in our ACL 2025 paper, ["Zero-Shot Text-to-Speech for Vietnamese"](https://arxiv.org/abs/2506.01322). If you use this model in your work, please cite the paper:

	```bibtex
	@inproceedings{vu2025zeroshottexttospeechvietnamese,
	title={Zero-Shot Text-to-Speech for Vietnamese},
	author={Thi Vu and Linh The Nguyen and Dat Quoc Nguyen},
	year={2025},
	booktitle={Proceedings of ACL},
	}
	```

	## Usage
	```python
	import torch
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	device = "cuda:0" if torch.cuda.is_available() else "cpu"

	model_name = "thivux/PhoTextNormalization"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

	text = 'Một tháng có 30 hoặc 31 ngày, riêng tháng 2 có 28 ngày.'
	inputs = tokenizer(text, return_tensors="pt", padding=True,
	truncation=True, max_length=1024).to(device)

	# Generate translations
	with torch.no_grad():
	translated_tokens = model.generate(
	**inputs, max_length=1024, num_beams=5)

	# Decode
	decoded_outputs = [tokenizer.decode(output, skip_special_tokens=True)
	for output in translated_tokens]

	# decoded_outputs: ['một tháng có ba mươi hoặc ba mươi mốt ngày, riêng tháng hai có hai tám ngày.']
	print(f'decoded_outputs: {decoded_outputs}')
	```