Feature Extraction
Transformers
PyTorch
ONNX
Safetensors
Turkish
English
modernbert
fill-mask
turkish
legal
turkish-legal
mecellem
TRUBA
MN5
text-embeddings-inference
Instructions to use newmindai/Mursit-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use newmindai/Mursit-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="newmindai/Mursit-Base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base") model = AutoModelForMaskedLM.from_pretrained("newmindai/Mursit-Base") - Notebooks
- Google Colab
- Kaggle
| base_model: ModernBERT-base | |
| language: | |
| - tr | |
| - en | |
| license: apache-2.0 | |
| pipeline_tag: feature-extraction | |
| library_name: transformers | |
| tags: | |
| - fill-mask | |
| - turkish | |
| - legal | |
| - turkish-legal | |
| - mecellem | |
| - modernbert | |
| - TRUBA | |
| - MN5 | |
| # Mursit-Base | |
| [](https://github.com/newmindai/mecellem-models) [](https://huggingface.co/spaces/newmindai/Mizan) [](https://opensource.org/licenses/Apache-2.0) | |
| Mursit-Base is a Turkish Masked Language Model pre-trained entirely from scratch on Turkish-dominant corpora, as introduced in the paper [Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain](https://huggingface.co/papers/2601.16018). | |
| ## Model Description | |
| Mursit-Base is a Turkish Masked Language Model pre-trained entirely from scratch on Turkish-dominant corpora. The model is based on ModernBERT-base architecture (155M parameters) and serves as a foundation model for downstream tasks including text classification, named entity recognition, and feature extraction. Unlike domain-adaptive approaches that continue training from existing checkpoints, this model is initialized randomly and trained on a carefully curated dataset combining Turkish legal text with general web data. | |
| **Key Features:** | |
| - Pre-trained from scratch on approximately 112.7 billion tokens of Turkish-dominant corpus | |
| - Achieves 57.62% MLM accuracy on Turkish datasets (80-10-10 masking strategy, evaluated at 15% masking rate) | |
| - Serves as foundation for downstream embedding tasks (Mursit-Base-TR-Retrieval) | |
| - Custom tokenizer optimized for Turkish morphological structure | |
| - Pre-trained with 30% masking rate (ModernBERT/RoBERTa approach) but evaluated at 15% masking rate for fair comparison | |
| **Model Type:** Masked Language Model (MLM) | |
| **Parameters:** 155M | |
| **Base Architecture:** ModernBERT-base | |
| **Hidden Size:** 768 | |
| **Max Sequence Length:** 1,024 tokens | |
| ### Architecture Details | |
| - **Layers:** 22 transformer layers | |
| - **Hidden Size:** 768 | |
| - **FFN Size:** 1,152 | |
| - **Attention Heads:** 12 heads with 64 dimensions each | |
| - **Activation:** GeGLU (Gated Linear Units with GELU) | |
| - **Normalization:** RMSNorm | |
| - **Position Embeddings:** Rotary positional embeddings (RoPE) with θ=10,000 | |
| - **Window Size:** 128 (for sliding window attention in local layers) | |
| - **Vocabulary Size:** 59,008 tokens | |
| ### Training Details | |
| **Pre-training:** | |
| - **Dataset:** Turkish-dominant corpus totaling approximately 112.7 billion tokens | |
| - **Legal Sources:** | |
| - Court of Cassation (Yargıtay): 10.3M sequences, ~3.43B tokens | |
| - Council of State (Danıştay): 151K sequences, ~0.11B tokens | |
| - Academic theses (YÖKTEZ): 21.1M sequences, ~9.61B tokens (after DocsOCR processing) | |
| - **General Turkish Sources:** | |
| - FineWeb2: General Turkish web data | |
| - CulturaX: Multilingual corpus (Turkish subset) | |
| - Total general Turkish: 212M sequences, ~96.17B tokens | |
| - **Data Processing:** SemHash-based semantic deduplication, FineWeb quality filtering, URL-based filtering, page-packing for YÖKTEZ documents | |
| - **Training Method:** Masked Language Modeling (MLM) with 15% masking probability | |
| - **Masking Strategy:** 80% [MASK], 10% random token, 10% unchanged (80-10-10 strategy) | |
| - **Framework:** MosaicML Composer with Decoupled StableAdamW optimizer | |
| - **Learning Rate:** 5×10⁻⁴ with warmup_stable_decay schedule | |
| - **Precision:** BF16 mixed precision | |
| - **Hardware Infrastructure:** | |
| - **System:** MareNostrum 5 ACC partition at Barcelona Supercomputing Center (BSC) | |
| - **Compute Nodes:** 16 nodes | |
| - **GPUs:** 64× NVIDIA Hopper H100 64GB GPUs (4 GPUs per node) | |
| - **Node Configuration:** Each node equipped with 4× H100 GPUs, 80 CPU cores, 512GB DDR5 memory | |
| - **Interconnect:** 800 Gb/s InfiniBand for distributed training | |
| - **GPU Interconnect:** NVLink for intra-node GPU communication (4 GPUs per node connected via NVLink) | |
| - **Distributed Training:** Multi-node distributed training across 16 nodes with InfiniBand interconnect | |
| **MLM Accuracy:** 64.05% (evaluated on Turkish datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal) | |
| ### MLM Accuracy Scores (80-10-10 Strategy) on Turkish Datasets | |
| The following table presents MLM accuracy scores (averaged across the 80-10-10 strategy) for our pre-trained models and baseline MLM models evaluated on Turkish datasets. *This model's results are highlighted in italics.* | |
| | Model | MLM Avg (%) | | |
| |-------|-------------| | |
| | boun-tabilab/TabiBERT | **69.57** | | |
| | newmindai/Mursit-Large | 67.25 | | |
| | ytu-ce-cosmos/turkish-large-bert-cased | 65.03 | | |
| | dbmdz/bert-base-turkish-cased | 64.98 | | |
| | *newmindai/Mursit-Base* | *64.05* | | |
| | KocLab-Bilkent/BERTurk-Legal | 54.10 | | |
| | ytu-ce-cosmos/turkish-base-bert-uncased | 52.69 | | |
| *MLM accuracy averaged across the 80-10-10 masking strategy. turkish-base-bert-uncased was evaluated only on uncased datasets. Evaluation datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal. All experiments are reproducible (see Section A.2 in the paper).* | |
| ## Performance on MTEB-Turkish Benchmark | |
| The following visualization shows the model's performance compared to other Turkish language models: | |
|  | |
| *Model Performance Comparison: Legal Score vs. MTEB Score. MLM models (blue circles) form a distinct cluster. Mursit-Base achieves competitive performance among Turkish MLM models.* | |
| This model was evaluated on the comprehensive MTEB-Turkish benchmark for embedding tasks using mean pooling over token representations followed by L2 normalization. | |
| ### Comprehensive Benchmark Results | |
| The following table presents comprehensive evaluation results across all models evaluated on the MTEB-Turkish benchmark. *This model's results are highlighted in italics.* | |
| | Model | MTEB | Legal | Cls. | Clus. | Pair | Ret. | STS | Cont. | Reg. | Case | Params | Type | | |
| |-------|------|-------|------|-------|------|------|-----|-------|------|------|--------|------| | |
| | embeddinggemma-300m | **65.42** | 50.63 | **77.74** | **45.05** | **80.02** | **55.06** | 69.22 | 83.97 | **39.56** | 28.38 | 307M | Emb. | | |
| | bge-m3 | 62.87 | **51.16** | 75.35 | 35.86 | 78.88 | 54.42 | **69.83** | **86.08** | 38.09 | **29.3** | 567M | Emb. | | |
| | Mursit-Embed-Qwen3-1.7B-TR | 56.84 | 34.76 | 68.46 | 42.22 | 59.67 | 50.1 | 63.77 | 70.22 | 17.94 | 16.11 | 1.7B | CLM-E. | | |
| | Mursit-Large-TR-Retrieval | 56.87 | 46.56 | 67.72 | 41.15 | 59.78 | 51.69 | 64.01 | 81.78 | 32.67 | 25.24 | 403M | Emb. | | |
| | Mursit-Base-TR-Retrieval | 55.86 | 47.52 | 66.25 | 39.75 | 61.31 | 50.07 | 61.9 | 80.4 | 34.1 | 28.07 | 155M | Emb. | | |
| | Mursit-Embed-Qwen3-4B-TR | 53.65 | 37.0 | 67.29 | 36.68 | 58.36 | 51.12 | 54.77 | 69.25 | 24.21 | 17.56 | 4B | CLM-E. | | |
| |-------|------|-------|------|------|------|------|-----|-------|------|------|--------|------| | |
| | bert-base-turkish-uncased | 46.23 | 24.94 | 68.05 | 33.81 | 60.44 | 32.01 | 36.85 | 52.47 | 12.05 | 10.29 | 110M | MLM | | |
| | turkish-large-bert-cased | 45.3 | 19.12 | 67.43 | 34.24 | 60.11 | 28.68 | 36.04 | 47.57 | 5.93 | 3.85 | 337M | MLM | | |
| | bert-base-turkish-cased | 45.17 | 24.41 | 66.39 | 35.28 | 60.05 | 30.52 | 33.62 | 54.03 | 10.13 | 9.07 | 110M | MLM | | |
| | BERTurk-Legal | 42.02 | 32.63 | 60.61 | 26.24 | 59.51 | 25.8 | 37.94 | 61.4 | 15.51 | 20.99 | 184M | MLM | | |
| | Mursit-Large | 41.75 | 23.71 | 62.95 | 25.34 | 58.04 | 27.4 | 35.01 | 42.74 | 11.29 | 17.1 | 403M | MLM | | |
| | turkish-base-bert-uncased | 44.68 | 27.58 | 66.22 | 30.23 | 58.84 | 31.4 | 36.74 | 56.6 | 13.39 | 12.74 | 110M | MLM | | |
| | *Mursit-Base* | 40.23 | 17.93 | 59.78 | 25.48 | 58.65 | 20.82 | 36.45 | 36.0 | 7.4 | 10.4 | 155M | MLM | | |
| | mmBERT-base | 39.65 | 12.15 | 61.84 | 26.77 | 59.25 | 15.83 | 34.56 | 34.45 | 1.33 | 0.68 | 306M | MLM | | |
| | TabiBERT | 37.77 | 11.5 | 59.63 | 25.75 | 58.19 | 14.96 | 30.32 | 32.02 | 1.86 | 0.63 | 148M | MLM | | |
| | ModernBERT-base | 23.8 | 2.99 | 39.06 | 2.01 | 53.95 | 2.1 | 21.91 | 7.92 | 0.62 | 0.43 | 149M | MLM | | |
| | ModernBERT-large | 23.74 | 2.44 | 39.44 | 3.9 | 53.73 | 1.8 | 19.85 | 6.12 | 0.62 | 0.59 | 394M | MLM | | |
| **Column abbreviations:** MTEB = mean performance across task types; Legal = weighted average of Contracts, Regulation, Caselaw; Classification = accuracy on Turkish classification tasks; Clustering = V-measure on clustering tasks; Pair Classification = average precision on pair classification tasks like NLI; Retrieval = nDCG@10 on information retrieval tasks; Semantic Textual Similarity = Spearman correlation; Contracts = nDCG@10 on legal contract retrieval; Regulation = nDCG@10 on regulatory text retrieval; Caselaw = nDCG@10 on case law retrieval; Number of Parameters = number of model parameters; Model Type = model type (Embedding, CLM-Embedding, Masked Language Model). **Bold values** indicate the highest score in each column. | |
| **Key Findings:** | |
| - The model shows substantial improvement over ModernBERT baselines (which are monolingual English models), validating the effectiveness of Turkish-specific pre-training | |
| - Pre-training alone without embedding-specific fine-tuning yields limited utility for retrieval tasks | |
| - Language-specific pre-training is critical, as monolingual English models show limited cross-lingual transfer to Turkish | |
| - The model demonstrates that improvements in MLM accuracy do not always directly translate to better downstream task performance | |
| ### MLM vs Downstream Performance Analysis | |
| The following visualization shows the relationship between MLM validation loss and downstream retrieval performance: | |
|  | |
| *Relationship between MLM validation loss and downstream retrieval performance across ModernBERT-base versions v1-v6. This analysis demonstrates how improvements in MLM accuracy correlate with downstream task performance.* | |
| **Note:** This model is primarily designed for Masked Language Modeling tasks. Embedding performance is provided for reference using standard mean pooling. For optimal retrieval performance, consider using the post-trained retrieval variants (Mursit-Base-TR-Retrieval or Mursit-Large-TR-Retrieval). | |
| ## Reproducibility | |
| To reproduce the MLM benchmark results for this model, please refer to: | |
| - **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy. | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| ### Masked Language Modeling | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| import torch | |
| # Load model and tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base") | |
| model = AutoModelForMaskedLM.from_pretrained("newmindai/Mursit-Base") | |
| # Example text with mask | |
| text = "Türkiye Cumhuriyeti'nin başkenti [MASK]'dir." | |
| inputs = tokenizer(text, return_tensors="pt") | |
| # Predict masked token | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0] | |
| predictions = torch.nn.functional.softmax(outputs.logits[0, mask_token_index], dim=-1) | |
| # Get top predictions | |
| top_k = 5 | |
| top_indices = torch.topk(predictions[0], top_k).indices | |
| for idx in top_indices: | |
| token = tokenizer.decode([idx]) | |
| score = predictions[0][idx].item() | |
| print(f"{token}: {score:.4f}") | |
| ``` | |
| ### Feature Extraction | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| import torch | |
| tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base") | |
| model = AutoModel.from_pretrained("newmindai/Mursit-Base") | |
| text = "Türk hukuk sistemi medeni hukuk geleneğine dayanır" | |
| inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| # Mean pooling | |
| embeddings = outputs.last_hidden_state.mean(dim=1) | |
| print(embeddings.shape) # (batch_size, 768) | |
| ``` | |
| # ONNX Model Inference - Masked Language Modeling (MLM) | |
| This script demonstrates how to use the ONNX model from Hugging Face for masked language modeling tasks. | |
| ## Exporting Model to ONNX | |
| To export the model to ONNX format for MLM, use the `optimum-cli` command: | |
| ```bash | |
| optimum-cli export onnx \ | |
| -m newmindai/Mursit-Base \ | |
| --task fill-mask \ | |
| onnx/MursitBase | |
| ``` | |
| This will create the `model.onnx` file in the specified output directory. | |
| ## Installation | |
| ```bash | |
| pip install onnxruntime-gpu transformers huggingface_hub numpy | |
| ``` | |
| ## Usage | |
| ```python | |
| import numpy as np | |
| import onnxruntime as ort | |
| from transformers import AutoTokenizer | |
| from huggingface_hub import hf_hub_download | |
| repo_id = "newmindai/Mursit-Base" | |
| onnx_path = hf_hub_download(repo_id, "model.onnx") | |
| tokenizer = AutoTokenizer.from_pretrained(repo_id) | |
| sess = ort.InferenceSession( | |
| onnx_path, | |
| providers=["CUDAExecutionProvider", "CPUExecutionProvider"] | |
| ) | |
| text = f"Bu bir {tokenizer.mask_token} cümledir." | |
| inputs = tokenizer(text, return_tensors="np") | |
| outputs = sess.run(None, dict(inputs)) | |
| logits = outputs[0] | |
| mask_pos = np.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0][0] | |
| mask_logits = logits[0, mask_pos] | |
| top_k = 5 | |
| top_k_ids = np.argsort(mask_logits)[-top_k:][::-1] | |
| predictions = tokenizer.convert_ids_to_tokens(top_k_ids) | |
| print("MASK predictions:") | |
| for p in predictions: | |
| print(p) | |
| ``` | |
| ## Features | |
| - **Automatic GPU/CPU selection**: Uses CUDA if available, otherwise falls back to CPU | |
| - **Hugging Face integration**: Downloads model files directly from Hugging Face Hub | |
| - **Masked token prediction**: Predicts the most likely tokens for masked positions | |
| - **Top-K predictions**: Returns the top K most probable token predictions | |
| ## Use Cases | |
| - Turkish language understanding tasks | |
| - Text classification | |
| - Named entity recognition | |
| - Question answering | |
| - Text generation (with fine-tuning) | |
| - Feature extraction for downstream tasks | |
| - Domain adaptation for Turkish legal texts | |
| ## Reproducibility | |
| To reproduce the MLM benchmark results for this model, please refer to: | |
| - **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy. | |
| ## Acknowledgments | |
| This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project. | |
| The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration. | |
| ## References | |
| If you use this model, please cite our paper: | |
| ```bibtex | |
| @article{mecellem2026, | |
| title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain}, | |
| author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and İclal Çetin, and Sağbaş, Ömer Can}, | |
| journal={arXiv preprint arXiv:2601.16018}, | |
| year={2026}, | |
| month={January}, | |
| url={https://arxiv.org/abs/2601.16018}, | |
| doi={10.48550/arXiv.2601.16018}, | |
| eprint={2601.16018}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| ``` | |
| ### Base Model References | |
| ```bibtex | |
| @inproceedings{modernbert2025, | |
| title={ModernBERT: A Modern Bidirectional Encoder Transformer}, | |
| author={Answer.AI and LightOn}, | |
| booktitle={Proceedings of the 2025 Conference on Language Models}, | |
| year={2025} | |
| } | |
| ``` |