bert-cyberbullying-bahasa-classifier

A fine-tuned BERT multilingual classifier for detecting cyberbullying in Bahasa Indonesia. This model performs binary classification:

0 → non-bullying
1 → bullying

✅ Model Details

Property	Value
Model Type	BERT (base multilingual)
Task	Cyberbullying Detection (Text Classification)
Language	Bahasa Indonesia
Labels	`0` — non-bullying, `1` — bullying
Framework	Hugging Face Transformers
Files	`model.safetensors`, `config.json`, tokenizer files

📚 Dataset

This model was trained using a combined dataset, consisting of:

Indonesian cyberbullying dataset
Additional toxic / abusive comment datasets
Social media–style and chat–style text

Preprocessing steps:

text normalization
emoji removal
punctuation cleanup
lowercasing
label encoding (0 / 1)

Dataset was balanced to reduce bias.

🧠 Training Information

Base model: bert-base-multilingual-cased
Epochs: 3–5
Batch size: 16
Optimizer: AdamW
Learning rate: 2e-5
Loss: Cross Entropy
Train/Validation split: 80 / 20

Training was done on a 6GB GPU, optimized for low VRAM.

✅ How to Use

Python Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "zeltera/bert-cyberbullying-bahasa-classifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "anjing lu jelek banget"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
    label = torch.argmax(logits, dim=1).item()

print("Prediction:", label)  # 1 = bullying

Example Predictions

Text	Output
"mampus lu biarin aja"	1 (bullying)
"kamu lagi dimana?"	0 (non-bullying)
"bodoh banget sih"	1 (bullying)
"nice job bro"	0 (non-bullying)

📈 Evaluation

Metric	Score
Accuracy	~0.90
F1 (macro)	~0.88
Precision	~0.89
Recall	~0.87

🗂️ Repository Contents

config.json
model.safetensors
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.txt
README.md

🔧 Intended Use

AI chatbots (moderation / filtering)
Social media comment analysis
Cyberbullying detection systems
Student safety applications
Research on toxicity detection

⚠️ Limitations

Limited sarcasm detection
May misclassify unseen slang
Works best on Indonesian text
Not suitable for legal or high-risk decisions

📜 License

MIT License

👤 Author

Model trained and published by @zeltera Built using Hugging Face Transformers + PyTorch. Contact instagram @gnwnadiwjy

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for zeltera/bert-cyberbullying-bahasa-classifier

Base model

indobenchmark/indobert-base-p1

Finetuned

(129)

this model

zeltera
/

bert-cyberbullying-bahasa-classifier