CausaSent PhoBERT — Vietnamese Joint ATE + ABSA Tagger

Two-head fine-tune of PhoBERT-large for Vietnamese e-commerce reviews:

  1. Head A — Aspect Term Extraction with closed 7-class category via 15-way BIO tagging.
  2. Head B — Binary sentiment (positive / negative) at every B-token position.

Trained with an auxiliary supervised contrastive loss on B-token encoder representations grouped by aspect category (λ = 0.1). Backbone is vinai/phobert-large (Vietnamese RoBERTa, 370M params). Trained on CausaSent ATE v2 — 7,066 reviews / 10,307 annotations.

📦 GitHub: https://github.com/tamir39/CausaSent · 📊 Dataset: Tamir39/causasent-ate-v2


Closed taxonomy (7 aspect categories)

delivery · packaging · product_quality · price · customer_service · usability · appearance

Sentiment is binary: positive / negative. Neutral was dropped from upstream gold annotations (under 3% of records, insufficient for stable 3-class training).

Architecture

Vietnamese review (raw)
        │
        ▼  VnCoreNLP word-segmenter  ─►  PhoBERT BPE tokenizer
        │
        ▼
PhoBERT-large encoder  (24 layers, hidden=1024, BPE 64k)
        │
        ├──►  Linear(1024 → 15)   ─►  Head A — BIO over 7 aspect categories
        │
        └──►  Linear(1024 → 2)    ─►  Head B — sentiment at B-token positions
                                       (other positions masked with IGNORE_INDEX)
                                       │
        ▼                              ▼
  Supervised contrastive aux. loss on B-token reps, grouped by aspect_category (λ=0.1)

Total loss: L = L_ATE + λ_s · L_sent + λ_c · L_con with λ_s = 1.0, λ_c = 0.1.

Training setup

Hyperparameter Value
Backbone vinai/phobert-large (370M params)
Word segmenter VnCoreNLP-1.1.1 (annotators=wseg)
Tokenizer PhoBERT BPE (vocab 64k)
Optimizer AdamW, lr 2e-5, weight_decay 0.01
Batch size 16
Max sequence length 128 tokens
Epochs 8 (best at epoch 6)
LR schedule 10% linear warmup → linear decay
Loss Class-weighted CE (both heads) + SupCon (λ=0.1)
Precision fp16
Hardware Tesla P100 16GB (Kaggle Notebook)
Wall-clock ≈ 50 minutes
Seed 42
Model selection argmax_epoch mean(F1_ATE_entity, F1_sent_macro)

Best checkpoint reached mean = 0.6029 on val at epoch 6 (F1_ATE = 0.328, F1_sent = 0.874). Epochs 7–8 lower due to mild CE overfit on minority classes.

Test-set results (719 reviews / 1,023 entities)

Overall

Metric Value
ATE entity-level precision 0.2064
ATE entity-level recall 0.7979
ATE entity-level F1 0.3280
ATE token-level F1 (micro) 0.3207
ATE token-level F1 (macro) 0.3128
Sentiment accuracy 0.8809
Sentiment macro F1 0.8760
Sentiment F1 (negative) 0.9007
Sentiment recall (negative) 0.9280

Sentiment per class (at B-token positions)

Class Precision Recall F1 Support
positive 0.8901 0.8157 0.8513 407
negative 0.8752 0.9277 0.9007 567
macro 0.8826 0.8717 0.8760 974
weighted 0.8814 0.8809 0.8804 974

Entity-level F1 per aspect

Aspect F1 Precision Recall Support
delivery 0.4291 0.281 0.906 127
packaging 0.3698 0.235 0.873 118
price 0.3203 0.204 0.752 109
appearance 0.3074 0.192 0.779 312
product_quality 0.2974 0.184 0.777 229
customer_service 0.2363 0.139 0.782 55
usability 0.2295 0.152 0.467 15

Operating point — deliberate recall-first

This model is tuned for high recall (R ≈ 0.80) at the expense of span-boundary precision (P ≈ 0.21). This is intentional, not a tuning failure. The model lives inside a pipeline whose downstream aggregator ranks aspect_term surface forms by frequency within each (aspect_category, sentiment) cell:

  • Spurious false-positive spans tend to appear once and get filtered out of the top-K ranking.
  • Genuine complaints repeat across reviews and dominate the ranking.
  • The two pieces that must be precise — aspect_category and sentiment — are exactly the two the model is strongest on (F1_sent = 0.876, sentiment accuracy 0.881).

If you need higher span precision (e.g. for building another dataset), set min_confidence ≥ 0.7 in the inference pipeline:

min_conf Tuples kept Precision Recall F1
0.0 100% 0.206 0.798 0.328
0.5 94% 0.218 0.751 0.338
0.7 78% 0.262 0.659 0.375
0.8 62% 0.305 0.561 0.395
0.9 41% 0.371 0.412 0.391

F1 peaks around min_conf = 0.8 (0.395). For batch dashboards we keep the default (0.0) so no genuine complaint is missed.

Quick start

Option 1 — Via the CausaSent inference pipeline (recommended)

from src.inference.pipeline import CausaSentPipeline

pipe = CausaSentPipeline(phobert_ckpt="best.pt")  # or HF cache path
tuples = pipe("Hàng giao chậm 5 ngày, hộp móp méo, nhân viên trả lời cộc lốc.")
for t in tuples:
    print(f"{t.aspect_category:18s} {t.sentiment:8s} {t.aspect_term!r:20s} conf={t.confidence:.3f}")

Expected output:

delivery           negative 'Hàng'               conf=0.953
packaging          negative 'hộp'                conf=0.942
customer_service   negative 'nhân viên'          conf=0.918

Option 2 — Loading state dict directly

import torch
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(repo_id="Tamir39/causasent-phobert-ate", filename="best.pt")
state = torch.load(ckpt_path, map_location="cpu")
# state contains: {"model": OrderedDict[...], "epoch": 6, "config": {...}, ...}
# Load into the two-head model defined in src/training/model.py.

⚠ The checkpoint is a full torch.save dump (not safetensors). Load with strict=False because training also persists class-weight tensors that eval-time models do not reconstruct.

Option 3 — End-to-end aggregator + LLM action recommendation

# Aggregate batch of reviews → priority-ranked Vietnamese imperative actions
python -m src.inference.batch \
  --input reviews.csv \
  --review-col review \
  --ckpt best.pt \
  --output analysis.json \
  --top-k 5

The repository ships a FastAPI service (REST + SSE streaming) and a Next.js 14 PWA dashboard for interactive use.

Files

File Description
best.pt Selected checkpoint — epoch 6, val mean = 0.6029. Use this.
last.pt Final-epoch state (epoch 8). Provided for completeness.

Both are full torch.save({"model": state_dict, ...}) dumps.

Reproducibility

  • Training script: src/training/train.py in the GitHub repo.
  • Dataset preparation: scripts/build_ate_dataset.py → outputs data/processed/ate/{train,val,test}.json.
  • Evaluation: src/training/eval.py (uses seqeval for entity-level, sklearn.metrics for sentiment).
  • All seeds fixed to 42. Reported metrics are deterministic — re-running on the same hardware reproduces every digit.

Intended use & limitations

Intended: Vietnamese e-commerce review analytics, ABSA research baselines, building seller-facing dashboards that turn unstructured complaint text into prioritized actions.

Limitations:

  • Taxonomy is closed and fixed to 7 categories — extending requires schema change + re-train.
  • Trained mostly on Tiki product reviews (84% of training data) + hotel/restaurant gold (16%). Out-of-domain reviews (food delivery, travel, mobile apps) may underperform.
  • Sentiment is binary; ambivalent reviews ("không tệ", "tạm được") map to one of two labels with no calibration uncertainty.
  • Silver-pool labels come from a single LLM (Claude 3.5 Sonnet) — systematic errors of that labeler can propagate.
  • Validated for Vietnamese e-commerce text only. Do not use for medical, legal, or safety-critical decisions.

Ethical considerations

  • The Tiki silver pool was collected from publicly displayed product reviews; only text and derived annotations are redistributed.
  • The model can be used to filter / down-rank seller listings based on aggregated negative signal — please use such applications with transparency to both sellers and shoppers.
  • No PII removal pipeline was applied beyond what the upstream datasets did — reviews are short and product-focused, so PII risk is low, but consumers of this model should audit their own data.

Citation

@misc{causasent-phobert-2026,
  title  = {CausaSent: Joint Aspect Term Extraction and Sentiment Classification
            for Vietnamese E-commerce Reviews with Aggregated Action Recommendations},
  author = {Phí Vương Tường Tâm},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Tamir39/causasent-phobert-ate}}
}

License

CC-BY-SA-4.0. Built on:

  • vinai/phobert-large — MIT (Nguyen & Nguyen 2020).
  • VnCoreNLP — GPL-3.0 (Vu et al. 2018) — used only at preprocessing time, not redistributed with this model.
  • CausaSent ATE v2 dataset — CC-BY-SA-4.0.

Derivative works must share-alike (CC-BY-SA-4.0). Respect upstream licenses when redistributing.

Acknowledgements

  • VinAI Research for vinai/phobert-large and VnCoreNLP.
  • The CausaSent ATE v2 dataset contributors — including the midterm-project team who provided the Tiki review labels that bootstrapped the silver pool.
  • Anthropic Claude 3.5 Sonnet — used for silver-pool aspect-term span extraction and inter-annotator validation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tamir39/causasent-phobert-ate

Finetuned
(27)
this model

Dataset used to train Tamir39/causasent-phobert-ate

Evaluation results