|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: scikit-learn |
|
|
tags: |
|
|
- scikit-learn |
|
|
- sklearn |
|
|
- text-classification |
|
|
- vietnamese |
|
|
- nlp |
|
|
- sonar |
|
|
- tf-idf |
|
|
- logistic-regression |
|
|
- svc |
|
|
- support-vector-classification |
|
|
datasets: |
|
|
- vntc |
|
|
- undertheseanlp/UTS2017_Bank |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1-score |
|
|
model-index: |
|
|
- name: sonar-core-1 |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Vietnamese News Classification |
|
|
dataset: |
|
|
name: VNTC |
|
|
type: vntc |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.9280 |
|
|
name: Test Accuracy (SVC) |
|
|
- type: precision |
|
|
value: 0.92 |
|
|
name: Weighted Precision |
|
|
- type: recall |
|
|
value: 0.92 |
|
|
name: Weighted Recall |
|
|
- type: f1-score |
|
|
value: 0.92 |
|
|
name: Weighted F1-Score |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Vietnamese Banking Text Classification |
|
|
dataset: |
|
|
name: UTS2017_Bank |
|
|
type: undertheseanlp/UTS2017_Bank |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.7247 |
|
|
name: Test Accuracy (SVC) |
|
|
- type: precision |
|
|
value: 0.65 |
|
|
name: Weighted Precision (SVC) |
|
|
- type: recall |
|
|
value: 0.72 |
|
|
name: Weighted Recall (SVC) |
|
|
- type: f1-score |
|
|
value: 0.66 |
|
|
name: Weighted F1-Score (SVC) |
|
|
language: |
|
|
- vi |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# Sonar Core 1 - Vietnamese Text Classification Model |
|
|
|
|
|
A machine learning-based text classification model designed for Vietnamese language processing. Built on TF-IDF feature extraction pipeline combined with Support Vector Classification (SVC) and Logistic Regression, achieving **92.80% accuracy** on VNTC (news) and **72.47% accuracy** on UTS2017_Bank (banking) datasets with SVC. |
|
|
|
|
|
📋 **[View Detailed System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md)** for comprehensive model documentation, performance analysis, and limitations. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Sonar Core 1** is a Vietnamese text classification model that supports multiple domains including news categorization and banking text classification. The model is specifically designed for Vietnamese news article classification, banking text categorization, content categorization for Vietnamese text, and document organization and tagging. |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Algorithm**: TF-IDF + SVC/Logistic Regression Pipeline |
|
|
- **Feature Extraction**: CountVectorizer with 20,000 max features |
|
|
- **N-gram Support**: Unigram and bigram (1-2) |
|
|
- **TF-IDF**: Transformation with IDF weighting |
|
|
- **Classifier**: Support Vector Classification (SVC) / Logistic Regression with optimized parameters |
|
|
- **Framework**: scikit-learn ≥1.6 |
|
|
- **Caching System**: Hash-based caching for efficient processing |
|
|
|
|
|
## Supported Datasets & Categories |
|
|
|
|
|
### VNTC Dataset - News Categories (10 classes) |
|
|
1. **chinh_tri_xa_hoi** - Politics and Society |
|
|
2. **doi_song** - Lifestyle |
|
|
3. **khoa_hoc** - Science |
|
|
4. **kinh_doanh** - Business |
|
|
5. **phap_luat** - Law |
|
|
6. **suc_khoe** - Health |
|
|
7. **the_gioi** - World News |
|
|
8. **the_thao** - Sports |
|
|
9. **van_hoa** - Culture |
|
|
10. **vi_tinh** - Information Technology |
|
|
|
|
|
### UTS2017_Bank Dataset - Banking Categories (14 classes) |
|
|
1. **ACCOUNT** - Account services |
|
|
2. **CARD** - Card services |
|
|
3. **CUSTOMER_SUPPORT** - Customer support |
|
|
4. **DISCOUNT** - Discount offers |
|
|
5. **INTEREST_RATE** - Interest rate information |
|
|
6. **INTERNET_BANKING** - Internet banking services |
|
|
7. **LOAN** - Loan services |
|
|
8. **MONEY_TRANSFER** - Money transfer services |
|
|
9. **OTHER** - Other services |
|
|
10. **PAYMENT** - Payment services |
|
|
11. **PROMOTION** - Promotional offers |
|
|
12. **SAVING** - Savings accounts |
|
|
13. **SECURITY** - Security features |
|
|
14. **TRADEMARK** - Trademark/branding |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install scikit-learn>=1.6 joblib |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Training the Model |
|
|
|
|
|
#### VNTC Dataset (News Classification) |
|
|
```bash |
|
|
# Default training with VNTC dataset |
|
|
python train.py --dataset vntc --model logistic |
|
|
|
|
|
# With specific parameters |
|
|
python train.py --dataset vntc --model logistic --max-features 20000 --ngram-min 1 --ngram-max 2 |
|
|
``` |
|
|
|
|
|
#### UTS2017_Bank Dataset (Banking Text Classification) |
|
|
```bash |
|
|
# Train with UTS2017_Bank dataset (SVC recommended) |
|
|
python train.py --dataset uts2017 --model svc_linear |
|
|
|
|
|
# Train with Logistic Regression |
|
|
python train.py --dataset uts2017 --model logistic |
|
|
|
|
|
# With specific parameters (SVC) |
|
|
python train.py --dataset uts2017 --model svc_linear --max-features 20000 --ngram-min 1 --ngram-max 2 |
|
|
|
|
|
# Compare multiple configurations |
|
|
python train.py --dataset uts2017 --compare |
|
|
``` |
|
|
|
|
|
### Training from Scratch |
|
|
|
|
|
```python |
|
|
from train import train_notebook |
|
|
|
|
|
# Train VNTC model |
|
|
vntc_results = train_notebook( |
|
|
dataset="vntc", |
|
|
model_name="logistic", |
|
|
max_features=20000, |
|
|
ngram_min=1, |
|
|
ngram_max=2 |
|
|
) |
|
|
|
|
|
# Train UTS2017_Bank model |
|
|
bank_results = train_notebook( |
|
|
dataset="uts2017", |
|
|
model_name="logistic", |
|
|
max_features=20000, |
|
|
ngram_min=1, |
|
|
ngram_max=2 |
|
|
) |
|
|
``` |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
### VNTC Dataset Performance |
|
|
- **Training Accuracy**: 95.39% |
|
|
- **Test Accuracy (SVC)**: 92.80% |
|
|
- **Test Accuracy (Logistic Regression)**: 92.33% |
|
|
- **Training Samples**: 33,759 |
|
|
- **Test Samples**: 50,373 |
|
|
- **Training Time (SVC)**: ~54.6 minutes |
|
|
- **Training Time (Logistic Regression)**: ~31.40 seconds |
|
|
- **Best Performing**: Sports (98% F1-score) |
|
|
- **Challenging Category**: Lifestyle (76% F1-score) |
|
|
|
|
|
### UTS2017_Bank Dataset Performance |
|
|
- **Training Accuracy (SVC)**: 95.07% |
|
|
- **Test Accuracy (SVC)**: 72.47% |
|
|
- **Test Accuracy (Logistic Regression)**: 70.96% |
|
|
- **Training Samples**: 1,581 |
|
|
- **Test Samples**: 396 |
|
|
- **Training Time (SVC)**: ~5.3 seconds |
|
|
- **Training Time (Logistic Regression)**: ~0.78 seconds |
|
|
- **Best Performing**: TRADEMARK (89% F1-score with SVC), CUSTOMER_SUPPORT (77% F1-score with SVC) |
|
|
- **SVC Improvements**: LOAN (+0.50 F1), DISCOUNT (+0.22 F1), INTEREST_RATE (+0.18 F1) |
|
|
- **Challenges**: Many minority classes with insufficient training data |
|
|
|
|
|
## Using the Pre-trained Models |
|
|
|
|
|
### VNTC Model (Vietnamese News Classification) |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import joblib |
|
|
|
|
|
# Download and load VNTC model |
|
|
vntc_model = joblib.load( |
|
|
hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib") |
|
|
) |
|
|
|
|
|
# Enhanced prediction function |
|
|
def predict_text(model, text): |
|
|
probabilities = model.predict_proba([text])[0] |
|
|
|
|
|
# Get top 3 predictions sorted by probability |
|
|
top_indices = probabilities.argsort()[-3:][::-1] |
|
|
top_predictions = [] |
|
|
for idx in top_indices: |
|
|
category = model.classes_[idx] |
|
|
prob = probabilities[idx] |
|
|
top_predictions.append((category, prob)) |
|
|
|
|
|
# The prediction should be the top category |
|
|
prediction = top_predictions[0][0] |
|
|
confidence = top_predictions[0][1] |
|
|
|
|
|
return prediction, confidence, top_predictions |
|
|
|
|
|
# Make prediction on news text |
|
|
news_text = "Đội tuyển bóng đá Việt Nam giành chiến thắng" |
|
|
prediction, confidence, top_predictions = predict_text(vntc_model, news_text) |
|
|
|
|
|
print(f"News category: {prediction}") |
|
|
print(f"Confidence: {confidence:.3f}") |
|
|
print("Top 3 predictions:") |
|
|
for i, (category, prob) in enumerate(top_predictions, 1): |
|
|
print(f" {i}. {category}: {prob:.3f}") |
|
|
``` |
|
|
|
|
|
### UTS2017_Bank Model (Vietnamese Banking Text Classification) |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import joblib |
|
|
|
|
|
# Download and load UTS2017_Bank model (latest SVC model) |
|
|
bank_model = joblib.load( |
|
|
hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib") |
|
|
) |
|
|
|
|
|
# Enhanced prediction function (same as above) |
|
|
def predict_text(model, text): |
|
|
probabilities = model.predict_proba([text])[0] |
|
|
|
|
|
# Get top 3 predictions sorted by probability |
|
|
top_indices = probabilities.argsort()[-3:][::-1] |
|
|
top_predictions = [] |
|
|
for idx in top_indices: |
|
|
category = model.classes_[idx] |
|
|
prob = probabilities[idx] |
|
|
top_predictions.append((category, prob)) |
|
|
|
|
|
# The prediction should be the top category |
|
|
prediction = top_predictions[0][0] |
|
|
confidence = top_predictions[0][1] |
|
|
|
|
|
return prediction, confidence, top_predictions |
|
|
|
|
|
# Make prediction on banking text |
|
|
bank_text = "Tôi muốn mở tài khoản tiết kiệm" |
|
|
prediction, confidence, top_predictions = predict_text(bank_model, bank_text) |
|
|
|
|
|
print(f"Banking category: {prediction}") |
|
|
print(f"Confidence: {confidence:.3f}") |
|
|
print("Top 3 predictions:") |
|
|
for i, (category, prob) in enumerate(top_predictions, 1): |
|
|
print(f" {i}. {category}: {prob:.3f}") |
|
|
``` |
|
|
|
|
|
### Using Both Models |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import joblib |
|
|
|
|
|
# Load both models |
|
|
vntc_model = joblib.load( |
|
|
hf_hub_download("undertheseanlp/sonar_core_1", "vntc_classifier_20250927_161550.joblib") |
|
|
) |
|
|
bank_model = joblib.load( |
|
|
hf_hub_download("undertheseanlp/sonar_core_1", "uts2017_bank_classifier_20250928_060819.joblib") |
|
|
) |
|
|
|
|
|
# Enhanced prediction function for both models |
|
|
def predict_text(model, text): |
|
|
probabilities = model.predict_proba([text])[0] |
|
|
|
|
|
# Get top 3 predictions sorted by probability |
|
|
top_indices = probabilities.argsort()[-3:][::-1] |
|
|
top_predictions = [] |
|
|
for idx in top_indices: |
|
|
category = model.classes_[idx] |
|
|
prob = probabilities[idx] |
|
|
top_predictions.append((category, prob)) |
|
|
|
|
|
# The prediction should be the top category |
|
|
prediction = top_predictions[0][0] |
|
|
confidence = top_predictions[0][1] |
|
|
|
|
|
return prediction, confidence, top_predictions |
|
|
|
|
|
# Function to classify any Vietnamese text |
|
|
def classify_vietnamese_text(text, domain="auto"): |
|
|
""" |
|
|
Classify Vietnamese text using appropriate model with detailed predictions |
|
|
|
|
|
Args: |
|
|
text: Vietnamese text to classify |
|
|
domain: "news", "banking", or "auto" to detect domain |
|
|
|
|
|
Returns: |
|
|
tuple: (prediction, confidence, top_predictions, domain_used) |
|
|
""" |
|
|
if domain == "news": |
|
|
prediction, confidence, top_predictions = predict_text(vntc_model, text) |
|
|
return prediction, confidence, top_predictions, "news" |
|
|
elif domain == "banking": |
|
|
prediction, confidence, top_predictions = predict_text(bank_model, text) |
|
|
return prediction, confidence, top_predictions, "banking" |
|
|
else: |
|
|
# Try both models and return higher confidence |
|
|
news_pred, news_conf, news_top = predict_text(vntc_model, text) |
|
|
bank_pred, bank_conf, bank_top = predict_text(bank_model, text) |
|
|
|
|
|
if news_conf > bank_conf: |
|
|
return f"NEWS: {news_pred}", news_conf, news_top, "news" |
|
|
else: |
|
|
return f"BANKING: {bank_pred}", bank_conf, bank_top, "banking" |
|
|
|
|
|
# Examples |
|
|
examples = [ |
|
|
"Đội tuyển bóng đá Việt Nam thắng 2-0", |
|
|
"Tôi muốn vay tiền mua nhà", |
|
|
"Chính phủ thông qua luật mới" |
|
|
] |
|
|
|
|
|
for text in examples: |
|
|
category, confidence, top_predictions, domain = classify_vietnamese_text(text) |
|
|
print(f"Text: {text}") |
|
|
print(f"Category: {category}") |
|
|
print(f"Confidence: {confidence:.3f}") |
|
|
print(f"Domain: {domain}") |
|
|
print("Top 3 predictions:") |
|
|
for i, (cat, prob) in enumerate(top_predictions, 1): |
|
|
print(f" {i}. {cat}: {prob:.3f}") |
|
|
print() |
|
|
``` |
|
|
|
|
|
## Model Parameters |
|
|
|
|
|
- `dataset`: Dataset to use ("vntc" or "uts2017") |
|
|
- `model`: Model type ("logistic" or "svc" - SVC recommended for best performance) |
|
|
- `max_features`: Maximum number of TF-IDF features (default: 20000) |
|
|
- `ngram_min/max`: N-gram range (default: 1-2) |
|
|
- `split_ratio`: Train/test split ratio for UTS2017 (default: 0.2) |
|
|
- `n_samples`: Optional sample limit for quick testing |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Language Specificity**: Only works with Vietnamese text |
|
|
2. **Domain Specificity**: Optimized for specific domains (news and banking) |
|
|
3. **Feature Limitations**: Limited to 20,000 most frequent features |
|
|
4. **Class Imbalance Sensitivity**: Performance degrades with imbalanced datasets |
|
|
5. **Specific Weaknesses**: |
|
|
- VNTC: Lower performance on lifestyle category (71% recall) |
|
|
- UTS2017_Bank: Poor performance on minority classes despite SVC improvements |
|
|
- SVC requires longer training time compared to Logistic Regression |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- Model reflects biases present in training datasets |
|
|
- Performance varies significantly across categories |
|
|
- Should be validated on target domain before deployment |
|
|
- Consider class imbalance when interpreting results |
|
|
|
|
|
## Additional Information |
|
|
|
|
|
- **Repository**: https://huggingface.co/undertheseanlp/sonar_core_1 |
|
|
- **Framework Version**: scikit-learn ≥1.6 |
|
|
- **Python Version**: 3.10+ |
|
|
- **System Card**: See [Sonar Core 1 - System Card](https://huggingface.co/undertheseanlp/sonar_core_1/blob/main/Sonar%20Core%201%20-%20System%20Card.md) for detailed documentation |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{undertheseanlp_2025, |
|
|
author = { undertheseanlp }, |
|
|
title = { Sonar Core 1 - Vietnamese Text Classification Model }, |
|
|
year = 2025, |
|
|
url = { https://huggingface.co/undertheseanlp/sonar_core_1 }, |
|
|
doi = { 10.57967/hf/6599 }, |
|
|
publisher = { Hugging Face } |
|
|
} |
|
|
``` |