Update README.md
Browse files
README.md
CHANGED
|
@@ -2,119 +2,102 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
base_model:
|
| 6 |
-
- answerdotai/ModernBERT-base
|
| 7 |
-
pipeline_tag: fill-mask
|
| 8 |
-
library_name: transformers
|
| 9 |
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
- cybersecurity
|
| 11 |
- ciscoAITeam
|
| 12 |
-
-
|
| 13 |
- CTI
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
[SecureBERT 2.0](https://arxiv.org/pdf/2510.00240) is a **domain-specific transformer model** built on top of ModernBERT, optimized for cybersecurity tasks. It produces contextualized embeddings for technical text and code, enabling applications such as masked language modeling, semantic search, named entity recognition, vulnerability detection, and code analysis.
|
| 20 |
-
|
| 21 |
-
Cybersecurity data is highly technical, heterogeneous, and rapidly evolving. SecureBERT 2.0 leverages domain-specific pretraining to capture complex, jargon-heavy, and context-dependent language. It integrates natural language text from threat reports, blogs, technical documentation, and source code, providing superior representations for tasks requiring deep understanding of both language and code.
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
## Model Details
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
- **Language:** English
|
| 32 |
-
- **License:** Apache
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
### ModernBERT Architecture
|
| 39 |
-
ModernBERT introduces extended attention and hierarchical encoding to handle long documents, structured text, and source code efficiently. It supports hybrid tokenization for both natural language and code, enabling multi-modal reasoning and long-range dependency modeling—critical for cybersecurity tasks.
|
| 40 |
-
|
| 41 |
-
### Pretraining Dataset
|
| 42 |
-
SecureBERT 2.0 was pretrained on a large and diverse corpus, approximately **13.6B text tokens** and **53.3M code tokens**, over **13× larger** than the original SecureBERT. The dataset includes:
|
| 43 |
-
|
| 44 |
-
| Dataset Category | Description |
|
| 45 |
-
|-----------------|-------------|
|
| 46 |
-
| Seed corpus | High-quality curated security articles, reports, and technical blogs |
|
| 47 |
-
| Large-scale web text | Open web content filtered for cybersecurity relevance |
|
| 48 |
-
| Reasoning-focused data | Security-oriented QA and reasoning datasets |
|
| 49 |
-
| Instruction-tuning data | Procedural and instructional texts for cybersecurity workflows |
|
| 50 |
-
| Code vulnerability corpus | Annotated open-source code focused on vulnerabilities |
|
| 51 |
-
| Cybersecurity dialogue data | Security conversations, Q&A, and analyst workflows |
|
| 52 |
-
| Original baseline dataset | Data from the first SecureBERT for continuity |
|
| 53 |
-
|
| 54 |
-
#### Dataset Statistics
|
| 55 |
-
|
| 56 |
-
| Dataset Category | Code Tokens | Text Tokens |
|
| 57 |
-
|-----------------|------------|------------|
|
| 58 |
-
| Seed corpus | 9,406,451 | 256,859,788 |
|
| 59 |
-
| Large-scale web text | 268,993 | 12,231,942,693 |
|
| 60 |
-
| Reasoning-focused data | -- | 3,229,293 |
|
| 61 |
-
| Instruction-tuning data | 61,590 | 2,336,218 |
|
| 62 |
-
| Code vulnerability corpus | 2,146,875 | -- |
|
| 63 |
-
| Cybersecurity dialogue data | 41,503,749 | 56,871,556 |
|
| 64 |
-
| Original baseline dataset | -- | 1,072,798,637 |
|
| 65 |
-
| **Total** | 53,387,658 | 13,623,037,185 |
|
| 66 |
|
| 67 |
---
|
| 68 |
|
| 69 |
-
|
| 70 |
-
- **Masked Language Modeling (MLM):** Random tokens in text and code are masked for prediction. Code-specific tokens (identifiers, operators) are masked to improve program understanding.
|
| 71 |
-
- **Microannealing Curriculum:** Gradually introduces diverse datasets, balancing high-quality and challenging data for optimal learning.
|
| 72 |
-
- **Optimization:** AdamW optimizer, learning rate of 5e-5, weight decay 0.01, MLM probability 0.10, 20 epochs, per-GPU batch size of 16 across 8 GPUs.
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
| 4 | 80.01% | 71.56% | 53.36% |
|
| 86 |
-
| 5 | 82.72% | 74.12% | 55.41% |
|
| 87 |
-
| 10 | 88.80% | 81.64% | 60.03% |
|
| 88 |
|
|
|
|
| 89 |
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
-
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
```
|
| 98 |
|
| 99 |
-
Load and use
|
| 100 |
```python
|
| 101 |
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
| 102 |
|
| 103 |
model_name = "CiscoAITeam/SecureBERT2.0-base"
|
| 104 |
-
|
| 105 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 106 |
model = AutoModelForMaskedLM.from_pretrained(model_name)
|
| 107 |
|
| 108 |
-
# Example masked sentence
|
| 109 |
text = "The malware exploits a vulnerability in the [MASK] system."
|
| 110 |
inputs = tokenizer(text, return_tensors="pt")
|
| 111 |
outputs = model(**inputs)
|
| 112 |
|
| 113 |
-
# Get predictions
|
| 114 |
predicted_token_id = outputs.logits.argmax(-1)
|
| 115 |
predicted_word = tokenizer.decode(predicted_token_id[0])
|
| 116 |
print(predicted_word)
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
```
|
| 120 |
-
SecureBERT 2.0 outperforms both the original SecureBERT and generic ModernBERT, particularly in code understanding and domain-specific terms.
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
+
- fill-mask
|
| 7 |
+
- transformers
|
| 8 |
+
- safetensors
|
| 9 |
+
- modernbert
|
| 10 |
- cybersecurity
|
| 11 |
- ciscoAITeam
|
| 12 |
+
- code
|
| 13 |
- CTI
|
| 14 |
+
datasets:
|
| 15 |
+
- custom
|
| 16 |
+
library_name: transformers
|
| 17 |
+
pipeline_tag: fill-mask
|
| 18 |
+
model-index:
|
| 19 |
+
- name: SecureBERT2.0-base
|
| 20 |
+
results: []
|
| 21 |
---
|
| 22 |
|
| 23 |
+
# Model Card for CiscoAITeam/SecureBERT2.0-base
|
| 24 |
|
| 25 |
+
SecureBERT 2.0 Base is a domain-specific transformer model optimized for cybersecurity tasks. It extends the ModernBERT architecture with cybersecurity-focused pretraining to produce contextualized embeddings for both technical text and code. SecureBERT 2.0 supports tasks like masked language modeling, semantic search, named entity recognition, vulnerability detection, and code analysis.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
| 29 |
## Model Details
|
| 30 |
|
| 31 |
+
### Model Description
|
| 32 |
+
|
| 33 |
+
SecureBERT 2.0 Base is designed for **deep contextual understanding of cybersecurity language and code**. It leverages domain-specific pretraining on a large, heterogeneous corpus covering threat reports, blogs, documentation, and codebases, making it effective for reasoning across natural language and programming syntax.
|
| 34 |
+
|
| 35 |
+
- **Developed by:** Cisco AI Team
|
| 36 |
+
- **Model type:** Transformer (ModernBERT architecture)
|
| 37 |
- **Language:** English
|
| 38 |
+
- **License:** Apache 2.0
|
| 39 |
+
- **Finetuned from model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
|
| 40 |
|
| 41 |
+
### Model Sources
|
| 42 |
|
| 43 |
+
- **Repository:** [https://huggingface.co/CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
|
| 44 |
+
- **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
+
## Uses
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
+
### Direct Use
|
| 51 |
|
| 52 |
+
- Masked language modeling for cybersecurity text and code
|
| 53 |
+
- Embedding generation for semantic search and retrieval
|
| 54 |
+
- Code and text feature extraction for downstream classification or clustering
|
| 55 |
+
- Named entity recognition (NER) on security-related entities
|
| 56 |
+
- Vulnerability detection in source code
|
| 57 |
|
| 58 |
+
### Downstream Use
|
| 59 |
|
| 60 |
+
Fine-tuning for:
|
| 61 |
+
- Threat intelligence extraction
|
| 62 |
+
- Security question answering
|
| 63 |
+
- Incident analysis and summarization
|
| 64 |
+
- Automated code review and vulnerability prediction
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
+
### Out-of-Scope Use
|
| 67 |
|
| 68 |
+
- Non-English or non-technical text
|
| 69 |
+
- General-purpose conversational AI
|
| 70 |
+
- Decision-making in real-time security systems without human oversight
|
| 71 |
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## Bias, Risks, and Limitations
|
| 75 |
+
|
| 76 |
+
The model reflects biases in the cybersecurity sources it was trained on, which may include:
|
| 77 |
+
- Overrepresentation of certain threat actors, technologies, or organizations
|
| 78 |
+
- Inconsistent code or documentation quality
|
| 79 |
+
- Limited exposure to non-public or proprietary data formats
|
| 80 |
+
|
| 81 |
+
### Recommendations
|
| 82 |
+
|
| 83 |
+
Users should evaluate outputs in their specific context and avoid automated high-stakes decisions without expert validation.
|
| 84 |
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## How to Get Started with the Model
|
|
|
|
| 88 |
|
|
|
|
| 89 |
```python
|
| 90 |
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
| 91 |
|
| 92 |
model_name = "CiscoAITeam/SecureBERT2.0-base"
|
|
|
|
| 93 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 94 |
model = AutoModelForMaskedLM.from_pretrained(model_name)
|
| 95 |
|
|
|
|
| 96 |
text = "The malware exploits a vulnerability in the [MASK] system."
|
| 97 |
inputs = tokenizer(text, return_tensors="pt")
|
| 98 |
outputs = model(**inputs)
|
| 99 |
|
|
|
|
| 100 |
predicted_token_id = outputs.logits.argmax(-1)
|
| 101 |
predicted_word = tokenizer.decode(predicted_token_id[0])
|
| 102 |
print(predicted_word)
|
| 103 |
+
```
|
|
|
|
|
|
|
|
|