Instructions to use Tamir39/causasent-phobert-ate with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tamir39/causasent-phobert-ate with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Tamir39/causasent-phobert-ate")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Tamir39/causasent-phobert-ate", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- CausaSent PhoBERT — Vietnamese Joint ATE + ABSA Tagger
CausaSent PhoBERT — Vietnamese Joint ATE + ABSA Tagger
Two-head fine-tune of PhoBERT-large for Vietnamese e-commerce reviews:
- Head A — Aspect Term Extraction with closed 7-class category via 15-way BIO tagging.
- Head B — Binary sentiment (positive / negative) at every B-token position.
Trained with an auxiliary supervised contrastive loss on B-token encoder representations grouped by aspect category (λ = 0.1). Backbone is vinai/phobert-large (Vietnamese RoBERTa, 370M params). Trained on CausaSent ATE v2 — 7,066 reviews / 10,307 annotations.
📦 GitHub: https://github.com/tamir39/CausaSent · 📊 Dataset: Tamir39/causasent-ate-v2
Closed taxonomy (7 aspect categories)
delivery · packaging · product_quality · price · customer_service · usability · appearance
Sentiment is binary: positive / negative. Neutral was dropped from upstream gold annotations (under 3% of records, insufficient for stable 3-class training).
Architecture
Vietnamese review (raw)
│
▼ VnCoreNLP word-segmenter ─► PhoBERT BPE tokenizer
│
▼
PhoBERT-large encoder (24 layers, hidden=1024, BPE 64k)
│
├──► Linear(1024 → 15) ─► Head A — BIO over 7 aspect categories
│
└──► Linear(1024 → 2) ─► Head B — sentiment at B-token positions
(other positions masked with IGNORE_INDEX)
│
▼ ▼
Supervised contrastive aux. loss on B-token reps, grouped by aspect_category (λ=0.1)
Total loss: L = L_ATE + λ_s · L_sent + λ_c · L_con with λ_s = 1.0, λ_c = 0.1.
Training setup
| Hyperparameter | Value |
|---|---|
| Backbone | vinai/phobert-large (370M params) |
| Word segmenter | VnCoreNLP-1.1.1 (annotators=wseg) |
| Tokenizer | PhoBERT BPE (vocab 64k) |
| Optimizer | AdamW, lr 2e-5, weight_decay 0.01 |
| Batch size | 16 |
| Max sequence length | 128 tokens |
| Epochs | 8 (best at epoch 6) |
| LR schedule | 10% linear warmup → linear decay |
| Loss | Class-weighted CE (both heads) + SupCon (λ=0.1) |
| Precision | fp16 |
| Hardware | Tesla P100 16GB (Kaggle Notebook) |
| Wall-clock | ≈ 50 minutes |
| Seed | 42 |
| Model selection | argmax_epoch mean(F1_ATE_entity, F1_sent_macro) |
Best checkpoint reached mean = 0.6029 on val at epoch 6 (F1_ATE = 0.328, F1_sent = 0.874). Epochs 7–8 lower due to mild CE overfit on minority classes.
Test-set results (719 reviews / 1,023 entities)
Overall
| Metric | Value |
|---|---|
| ATE entity-level precision | 0.2064 |
| ATE entity-level recall | 0.7979 |
| ATE entity-level F1 | 0.3280 |
| ATE token-level F1 (micro) | 0.3207 |
| ATE token-level F1 (macro) | 0.3128 |
| Sentiment accuracy | 0.8809 |
| Sentiment macro F1 | 0.8760 |
| Sentiment F1 (negative) | 0.9007 |
| Sentiment recall (negative) | 0.9280 |
Sentiment per class (at B-token positions)
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| positive | 0.8901 | 0.8157 | 0.8513 | 407 |
| negative | 0.8752 | 0.9277 | 0.9007 | 567 |
| macro | 0.8826 | 0.8717 | 0.8760 | 974 |
| weighted | 0.8814 | 0.8809 | 0.8804 | 974 |
Entity-level F1 per aspect
| Aspect | F1 | Precision | Recall | Support |
|---|---|---|---|---|
| delivery | 0.4291 | 0.281 | 0.906 | 127 |
| packaging | 0.3698 | 0.235 | 0.873 | 118 |
| price | 0.3203 | 0.204 | 0.752 | 109 |
| appearance | 0.3074 | 0.192 | 0.779 | 312 |
| product_quality | 0.2974 | 0.184 | 0.777 | 229 |
| customer_service | 0.2363 | 0.139 | 0.782 | 55 |
| usability | 0.2295 | 0.152 | 0.467 | 15 |
Operating point — deliberate recall-first
This model is tuned for high recall (R ≈ 0.80) at the expense of span-boundary precision (P ≈ 0.21). This is intentional, not a tuning failure. The model lives inside a pipeline whose downstream aggregator ranks aspect_term surface forms by frequency within each (aspect_category, sentiment) cell:
- Spurious false-positive spans tend to appear once and get filtered out of the top-K ranking.
- Genuine complaints repeat across reviews and dominate the ranking.
- The two pieces that must be precise —
aspect_categoryandsentiment— are exactly the two the model is strongest on (F1_sent = 0.876, sentiment accuracy 0.881).
If you need higher span precision (e.g. for building another dataset), set min_confidence ≥ 0.7 in the inference pipeline:
min_conf |
Tuples kept | Precision | Recall | F1 |
|---|---|---|---|---|
| 0.0 | 100% | 0.206 | 0.798 | 0.328 |
| 0.5 | 94% | 0.218 | 0.751 | 0.338 |
| 0.7 | 78% | 0.262 | 0.659 | 0.375 |
| 0.8 | 62% | 0.305 | 0.561 | 0.395 |
| 0.9 | 41% | 0.371 | 0.412 | 0.391 |
F1 peaks around min_conf = 0.8 (0.395). For batch dashboards we keep the default (0.0) so no genuine complaint is missed.
Quick start
Option 1 — Via the CausaSent inference pipeline (recommended)
from src.inference.pipeline import CausaSentPipeline
pipe = CausaSentPipeline(phobert_ckpt="best.pt") # or HF cache path
tuples = pipe("Hàng giao chậm 5 ngày, hộp móp méo, nhân viên trả lời cộc lốc.")
for t in tuples:
print(f"{t.aspect_category:18s} {t.sentiment:8s} {t.aspect_term!r:20s} conf={t.confidence:.3f}")
Expected output:
delivery negative 'Hàng' conf=0.953
packaging negative 'hộp' conf=0.942
customer_service negative 'nhân viên' conf=0.918
Option 2 — Loading state dict directly
import torch
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(repo_id="Tamir39/causasent-phobert-ate", filename="best.pt")
state = torch.load(ckpt_path, map_location="cpu")
# state contains: {"model": OrderedDict[...], "epoch": 6, "config": {...}, ...}
# Load into the two-head model defined in src/training/model.py.
⚠ The checkpoint is a full
torch.savedump (notsafetensors). Load withstrict=Falsebecause training also persists class-weight tensors that eval-time models do not reconstruct.
Option 3 — End-to-end aggregator + LLM action recommendation
# Aggregate batch of reviews → priority-ranked Vietnamese imperative actions
python -m src.inference.batch \
--input reviews.csv \
--review-col review \
--ckpt best.pt \
--output analysis.json \
--top-k 5
The repository ships a FastAPI service (REST + SSE streaming) and a Next.js 14 PWA dashboard for interactive use.
Files
| File | Description |
|---|---|
best.pt |
Selected checkpoint — epoch 6, val mean = 0.6029. Use this. |
last.pt |
Final-epoch state (epoch 8). Provided for completeness. |
Both are full torch.save({"model": state_dict, ...}) dumps.
Reproducibility
- Training script:
src/training/train.pyin the GitHub repo. - Dataset preparation:
scripts/build_ate_dataset.py→ outputsdata/processed/ate/{train,val,test}.json. - Evaluation:
src/training/eval.py(usesseqevalfor entity-level,sklearn.metricsfor sentiment). - All seeds fixed to 42. Reported metrics are deterministic — re-running on the same hardware reproduces every digit.
Intended use & limitations
Intended: Vietnamese e-commerce review analytics, ABSA research baselines, building seller-facing dashboards that turn unstructured complaint text into prioritized actions.
Limitations:
- Taxonomy is closed and fixed to 7 categories — extending requires schema change + re-train.
- Trained mostly on Tiki product reviews (84% of training data) + hotel/restaurant gold (16%). Out-of-domain reviews (food delivery, travel, mobile apps) may underperform.
- Sentiment is binary; ambivalent reviews ("không tệ", "tạm được") map to one of two labels with no calibration uncertainty.
- Silver-pool labels come from a single LLM (Claude 3.5 Sonnet) — systematic errors of that labeler can propagate.
- Validated for Vietnamese e-commerce text only. Do not use for medical, legal, or safety-critical decisions.
Ethical considerations
- The Tiki silver pool was collected from publicly displayed product reviews; only text and derived annotations are redistributed.
- The model can be used to filter / down-rank seller listings based on aggregated negative signal — please use such applications with transparency to both sellers and shoppers.
- No PII removal pipeline was applied beyond what the upstream datasets did — reviews are short and product-focused, so PII risk is low, but consumers of this model should audit their own data.
Citation
@misc{causasent-phobert-2026,
title = {CausaSent: Joint Aspect Term Extraction and Sentiment Classification
for Vietnamese E-commerce Reviews with Aggregated Action Recommendations},
author = {Phí Vương Tường Tâm},
year = {2026},
howpublished = {\url{https://huggingface.co/Tamir39/causasent-phobert-ate}}
}
License
CC-BY-SA-4.0. Built on:
vinai/phobert-large— MIT (Nguyen & Nguyen 2020).- VnCoreNLP — GPL-3.0 (Vu et al. 2018) — used only at preprocessing time, not redistributed with this model.
- CausaSent ATE v2 dataset — CC-BY-SA-4.0.
Derivative works must share-alike (CC-BY-SA-4.0). Respect upstream licenses when redistributing.
Acknowledgements
- VinAI Research for
vinai/phobert-largeand VnCoreNLP. - The CausaSent ATE v2 dataset contributors — including the midterm-project team who provided the Tiki review labels that bootstrapped the silver pool.
- Anthropic Claude 3.5 Sonnet — used for silver-pool aspect-term span extraction and inter-annotator validation.
Model tree for Tamir39/causasent-phobert-ate
Base model
vinai/phobert-largeDataset used to train Tamir39/causasent-phobert-ate
Evaluation results
- Entity-level F1 (seqeval) on CausaSent ATE v2test set self-reported0.328
- Entity-level precision on CausaSent ATE v2test set self-reported0.206
- Entity-level recall on CausaSent ATE v2test set self-reported0.798
- Token-level micro F1 on CausaSent ATE v2test set self-reported0.321
- Sentiment macro F1 on CausaSent ATE v2test set self-reported0.876
- Sentiment F1 (negative class) on CausaSent ATE v2test set self-reported0.901
- Sentiment accuracy on CausaSent ATE v2test set self-reported0.881