Polyglot Tagger: Multi-label Language Identification
Refer to polyglot-tagger/language-identification. It is trained on the same dataset as a text-classifier rather than as a token classifier.
This model is a fine-tuned version of xlm-roberta-base. It achieves the following results on the evaluation set:
- Loss: 0.0123
- Precision: 0.9859
- Recall: 0.9831
- F1: 0.9845
- Accuracy: 0.9412
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- gradient_accumulation_steps: 18
- total_train_batch_size: 576
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 2
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Accuracy | F1 | Validation Loss | Precision | Recall |
|---|---|---|---|---|---|---|---|
| 0.2186 | 0.2925 | 2500 | 0.8560 | 0.9651 | 0.0395 | 0.9778 | 0.9528 |
| 0.1331 | 0.5851 | 5000 | 0.0232 | 0.9803 | 0.9717 | 0.9760 | 0.9070 |
| 0.1044 | 0.8776 | 7500 | 0.0172 | 0.9828 | 0.9774 | 0.9801 | 0.9218 |
| 0.0851 | 1.1700 | 10000 | 0.0150 | 0.9844 | 0.9801 | 0.9822 | 0.9311 |
| 0.0783 | 1.4626 | 12500 | 0.0136 | 0.9859 | 0.9809 | 0.9834 | 0.9354 |
| 0.0705 | 1.7551 | 15000 | 0.0126 | 0.9861 | 0.9826 | 0.9843 | 0.9399 |
| 0.0692 | 2.0 | 17094 | 0.0123 | 0.9859 | 0.9831 | 0.9845 | 0.9412 |
Framework versions
- Transformers 5.5.4
- Pytorch 2.11.0+cu128
- Datasets 4.8.4
- Tokenizers 0.22.2
- Downloads last month
- 110
Model tree for polyglot-tagger/multilabel-language-identification
Base model
FacebookAI/xlm-roberta-base