CausaSent PhoBERT — Vietnamese Joint ATE + ABSA Tagger

Two-head fine-tune of PhoBERT-large for Vietnamese e-commerce reviews:

Head A — Aspect Term Extraction with closed 7-class category via 15-way BIO tagging.
Head B — Binary sentiment (positive / negative) at every B-token position.

Trained with an auxiliary supervised contrastive loss on B-token encoder representations grouped by aspect category (λ = 0.1). Backbone is vinai/phobert-large (Vietnamese RoBERTa, 370M params). Trained on CausaSent ATE v2 — 7,066 reviews / 10,307 annotations.

📦 GitHub: https://github.com/tamir39/CausaSent · 📊 Dataset: Tamir39/causasent-ate-v2

Closed taxonomy (7 aspect categories)

delivery · packaging · product_quality · price · customer_service · usability · appearance

Sentiment is binary: positive / negative. Neutral was dropped from upstream gold annotations (under 3% of records, insufficient for stable 3-class training).

Architecture

Vietnamese review (raw)
        │
        ▼  VnCoreNLP word-segmenter  ─►  PhoBERT BPE tokenizer
        │
        ▼
PhoBERT-large encoder  (24 layers, hidden=1024, BPE 64k)
        │
        ├──►  Linear(1024 → 15)   ─►  Head A — BIO over 7 aspect categories
        │
        └──►  Linear(1024 → 2)    ─►  Head B — sentiment at B-token positions
                                       (other positions masked with IGNORE_INDEX)
                                       │
        ▼                              ▼
  Supervised contrastive aux. loss on B-token reps, grouped by aspect_category (λ=0.1)

Total loss: L = L_ATE + λ_s · L_sent + λ_c · L_con with λ_s = 1.0, λ_c = 0.1.

Training setup

Hyperparameter	Value
Backbone	`vinai/phobert-large` (370M params)
Word segmenter	VnCoreNLP-1.1.1 (annotators=`wseg`)
Tokenizer	PhoBERT BPE (vocab 64k)
Optimizer	AdamW, lr 2e-5, weight_decay 0.01
Batch size	16
Max sequence length	128 tokens
Epochs	8 (best at epoch 6)
LR schedule	10% linear warmup → linear decay
Loss	Class-weighted CE (both heads) + SupCon (λ=0.1)
Precision	fp16
Hardware	Tesla P100 16GB (Kaggle Notebook)
Wall-clock	≈ 50 minutes
Seed	42
Model selection	argmax_epoch mean(F1_ATE_entity, F1_sent_macro)

Best checkpoint reached mean = 0.6029 on val at epoch 6 (F1_ATE = 0.328, F1_sent = 0.874). Epochs 7–8 lower due to mild CE overfit on minority classes.

Test-set results (719 reviews / 1,023 entities)

Overall

Metric	Value
ATE entity-level precision	0.2064
ATE entity-level recall	0.7979
ATE entity-level F1	0.3280
ATE token-level F1 (micro)	0.3207
ATE token-level F1 (macro)	0.3128
Sentiment accuracy	0.8809
Sentiment macro F1	0.8760
Sentiment F1 (negative)	0.9007
Sentiment recall (negative)	0.9280

Sentiment per class (at B-token positions)

Class	Precision	Recall	F1	Support
positive	0.8901	0.8157	0.8513	407
negative	0.8752	0.9277	0.9007	567
macro	0.8826	0.8717	0.8760	974
weighted	0.8814	0.8809	0.8804	974

Entity-level F1 per aspect

Aspect	F1	Precision	Recall	Support
delivery	0.4291	0.281	0.906	127
packaging	0.3698	0.235	0.873	118
price	0.3203	0.204	0.752	109
appearance	0.3074	0.192	0.779	312
product_quality	0.2974	0.184	0.777	229
customer_service	0.2363	0.139	0.782	55
usability	0.2295	0.152	0.467	15

Operating point — deliberate recall-first

This model is tuned for high recall (R ≈ 0.80) at the expense of span-boundary precision (P ≈ 0.21). This is intentional, not a tuning failure. The model lives inside a pipeline whose downstream aggregator ranks aspect_term surface forms by frequency within each (aspect_category, sentiment) cell:

Spurious false-positive spans tend to appear once and get filtered out of the top-K ranking.
Genuine complaints repeat across reviews and dominate the ranking.
The two pieces that must be precise — aspect_category and sentiment — are exactly the two the model is strongest on (F1_sent = 0.876, sentiment accuracy 0.881).

If you need higher span precision (e.g. for building another dataset), set min_confidence ≥ 0.7 in the inference pipeline:

`min_conf`	Tuples kept	Precision	Recall	F1
0.0	100%	0.206	0.798	0.328
0.5	94%	0.218	0.751	0.338
0.7	78%	0.262	0.659	0.375
0.8	62%	0.305	0.561	0.395
0.9	41%	0.371	0.412	0.391

F1 peaks around min_conf = 0.8 (0.395). For batch dashboards we keep the default (0.0) so no genuine complaint is missed.

Quick start

Option 1 — Via the CausaSent inference pipeline (recommended)

from src.inference.pipeline import CausaSentPipeline

pipe = CausaSentPipeline(phobert_ckpt="best.pt")  # or HF cache path
tuples = pipe("Hàng giao chậm 5 ngày, hộp móp méo, nhân viên trả lời cộc lốc.")
for t in tuples:
    print(f"{t.aspect_category:18s} {t.sentiment:8s} {t.aspect_term!r:20s} conf={t.confidence:.3f}")

Expected output:

delivery           negative 'Hàng'               conf=0.953
packaging          negative 'hộp'                conf=0.942
customer_service   negative 'nhân viên'          conf=0.918

Option 2 — Loading state dict directly

import torch
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(repo_id="Tamir39/causasent-phobert-ate", filename="best.pt")
state = torch.load(ckpt_path, map_location="cpu")
# state contains: {"model": OrderedDict[...], "epoch": 6, "config": {...}, ...}
# Load into the two-head model defined in src/training/model.py.

⚠ The checkpoint is a full torch.save dump (not safetensors). Load with strict=False because training also persists class-weight tensors that eval-time models do not reconstruct.

Option 3 — End-to-end aggregator + LLM action recommendation

# Aggregate batch of reviews → priority-ranked Vietnamese imperative actions
python -m src.inference.batch \
  --input reviews.csv \
  --review-col review \
  --ckpt best.pt \
  --output analysis.json \
  --top-k 5

The repository ships a FastAPI service (REST + SSE streaming) and a Next.js 14 PWA dashboard for interactive use.

Files

File	Description
`best.pt`	Selected checkpoint — epoch 6, val mean = 0.6029. Use this.
`last.pt`	Final-epoch state (epoch 8). Provided for completeness.

Both are full torch.save({"model": state_dict, ...}) dumps.

Reproducibility

Training script: src/training/train.py in the GitHub repo.
Dataset preparation: scripts/build_ate_dataset.py → outputs data/processed/ate/{train,val,test}.json.
Evaluation: src/training/eval.py (uses seqeval for entity-level, sklearn.metrics for sentiment).
All seeds fixed to 42. Reported metrics are deterministic — re-running on the same hardware reproduces every digit.

Intended use & limitations

Intended: Vietnamese e-commerce review analytics, ABSA research baselines, building seller-facing dashboards that turn unstructured complaint text into prioritized actions.

Limitations:

Taxonomy is closed and fixed to 7 categories — extending requires schema change + re-train.
Trained mostly on Tiki product reviews (84% of training data) + hotel/restaurant gold (16%). Out-of-domain reviews (food delivery, travel, mobile apps) may underperform.
Sentiment is binary; ambivalent reviews ("không tệ", "tạm được") map to one of two labels with no calibration uncertainty.
Silver-pool labels come from a single LLM (Claude 3.5 Sonnet) — systematic errors of that labeler can propagate.
Validated for Vietnamese e-commerce text only. Do not use for medical, legal, or safety-critical decisions.

Ethical considerations

The Tiki silver pool was collected from publicly displayed product reviews; only text and derived annotations are redistributed.
The model can be used to filter / down-rank seller listings based on aggregated negative signal — please use such applications with transparency to both sellers and shoppers.
No PII removal pipeline was applied beyond what the upstream datasets did — reviews are short and product-focused, so PII risk is low, but consumers of this model should audit their own data.

Citation

@misc{causasent-phobert-2026,
  title  = {CausaSent: Joint Aspect Term Extraction and Sentiment Classification
            for Vietnamese E-commerce Reviews with Aggregated Action Recommendations},
  author = {Phí Vương Tường Tâm},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Tamir39/causasent-phobert-ate}}
}

License

CC-BY-SA-4.0. Built on:

vinai/phobert-large — MIT (Nguyen & Nguyen 2020).
VnCoreNLP — GPL-3.0 (Vu et al. 2018) — used only at preprocessing time, not redistributed with this model.
CausaSent ATE v2 dataset — CC-BY-SA-4.0.

Derivative works must share-alike (CC-BY-SA-4.0). Respect upstream licenses when redistributing.

Acknowledgements

VinAI Research for vinai/phobert-large and VnCoreNLP.
The CausaSent ATE v2 dataset contributors — including the midterm-project team who provided the Tiki review labels that bootstrapped the silver pool.
Anthropic Claude 3.5 Sonnet — used for silver-pool aspect-term span extraction and inter-annotator validation.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Tamir39/causasent-phobert-ate

Base model

vinai/phobert-large

Finetuned

(27)

this model

Dataset used to train Tamir39/causasent-phobert-ate

Evaluation results

Entity-level F1 (seqeval) on CausaSent ATE v2
test set self-reported

0.328
Entity-level precision on CausaSent ATE v2
test set self-reported

0.206
Entity-level recall on CausaSent ATE v2
test set self-reported

0.798
Token-level micro F1 on CausaSent ATE v2
test set self-reported

0.321
Sentiment macro F1 on CausaSent ATE v2
test set self-reported

0.876
Sentiment F1 (negative class) on CausaSent ATE v2
test set self-reported

0.901
Sentiment accuracy on CausaSent ATE v2
test set self-reported

0.881