This model is the BanglaSTEM translation model, presented in the paper BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation.

The model is a T5-based translation model specifically trained on the BanglaSTEM dataset, which consists of 5,000 carefully selected Bangla-English sentence pairs from STEM fields. It aims to improve translation accuracy for technical content, enabling Bangla speakers to effectively use English-focused language models for technical problem-solving.

BanglaSTEM-T5: Technical Domain Translation Model

🎯 Overview

BanglaSTEM-T5 is a specialized translation model designed to accurately translate technical content between Bangla and English. Unlike general-purpose translation systems that struggle with technical terminology, this model preserves the precise meaning of STEM concepts, making it ideal for:

Programming & Software Development - Translate code-related questions and documentation
Mathematics - Handle mathematical concepts and problem statements
Science - Accurately translate physics, chemistry, and biology content
AI & Machine Learning - Work with technical AI/ML terminology

📊 Performance Benchmarks

Our model significantly outperforms existing translation systems on technical content:

Code Generation Task (400 Programming Problems)

Translation Method	Accuracy
Direct Bangla (no translation)	35.3%
BanglaT5-Base	59.8%
Google Translate	76.5%
BanglaSTEM-T5 (Ours)	82.5% ✨

Mathematical Problem Solving (100 Olympiad Problems)

Translation Method	Success Rate
Direct Bangla (no translation)	31.0%
BanglaT5-Base	59.0%
Google Translate	72.0%
BanglaSTEM-T5 (Ours)	79.0% ✨

Key Improvement: Our model achieves 22.7% higher accuracy than base models on code generation and 20% better on math problems.

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("reyazul/BanglaSTEM-T5")
model = AutoModelForSeq2SeqLM.from_pretrained("reyazul/BanglaSTEM-T5")

# Translate Bangla to English
bangla_text = "একটি পাইথন ফাংশন লিখুন যা একটি তালিকার সর্বোচ্চ মান খুঁজে বের করে।"
inputs = tokenizer(bangla_text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(english_translation)
# Output: "Write a Python function that finds the maximum value in a list."

Custom Generation Parameters

# For more accurate translations
outputs = model.generate(
    **inputs,
    max_length=256,
    num_beams=5,
    early_stopping=True,
    temperature=0.7,
    do_sample=False
)

📚 Model Details

Base Model: csebuetnlp/banglat5_nmt_en_bn
Parameters: 247M
Training Data: 5,000 high-quality technical sentence pairs
Domains Covered:
- Programming (52%)
- Mathematics (25.5%)
- Information Technology (23.7%)
- Physics (9.8%)
- Chemistry (7.3%)
- Biology & Bioinformatics (5.6%)
Quality Score: Mean translation accuracy of 4.41/5.0
Training Details:
- Learning rate: 5e-4
- Batch size: 64 (effective)
- Epochs: 8
- Precision: BF16 mixed precision

🎓 Citation

If you use BanglaSTEM-T5 in your research or applications, please cite our paper:

@article{hasan2025banglastem,
  title={BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation},
  author={Hasan, Kazi Reyazul and Musarrat, Mubasshira and Islam, ABM and Adnan, Muhammad Abdullah},
  journal={arXiv preprint arXiv:2511.03498},
  year={2025}
}

📖 Resources

Paper: BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation
Dataset: reyazul/BanglaSTEM
Model Card: reyazul/BanglaSTEM-T5

⚠️ Limitations

The dataset used for finetuning is currently not large-scale (we plan to expand it soon!)
The model works best with technical content in STEM domains
Performance on non-technical, general conversation may be similar to base models
Programming domain is most heavily represented in training data
For optimal results, input text should be grammatically correct

📜 License

This model is released under the Apache 2.0 License. See the LICENSE for details.

🙏 Acknowledgments

This work was supported by the Department of Computer Science and Engineering at Bangladesh University of Engineering and Technology (BUET). We thank all annotators who contributed to the human curation process.

Made with ❤️ for the Bangla NLP community

Report Issues • Request Features