Update README.md
Browse files
README.md
CHANGED
|
@@ -11,4 +11,109 @@ tags:
|
|
| 11 |
- ciscoAITeam
|
| 12 |
- Cyber
|
| 13 |
- CTI
|
| 14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- ciscoAITeam
|
| 12 |
- Cyber
|
| 13 |
- CTI
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
# SecureBERT 2.0 Base Model
|
| 18 |
+
|
| 19 |
+
SecureBERT 2.0 is a **domain-specific transformer model** built on top of ModernBERT, optimized for cybersecurity tasks. It produces contextualized embeddings for technical text and code, enabling applications such as masked language modeling, semantic search, named entity recognition, vulnerability detection, and code analysis.
|
| 20 |
+
|
| 21 |
+
Cybersecurity data is highly technical, heterogeneous, and rapidly evolving. SecureBERT 2.0 leverages domain-specific pretraining to capture complex, jargon-heavy, and context-dependent language. It integrates natural language text from threat reports, blogs, technical documentation, and source code, providing superior representations for tasks requiring deep understanding of both language and code.
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## Model Details
|
| 26 |
+
|
| 27 |
+
- **Architecture:** ModernBERT
|
| 28 |
+
- **Base Model:** answerdotai/ModernBERT-base
|
| 29 |
+
- **Pipeline Task:** fill-mask
|
| 30 |
+
- **Max Sequence Length:** 1024 tokens
|
| 31 |
+
- **Language:** English
|
| 32 |
+
- **License:** Apache-2.0
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Pretraining
|
| 37 |
+
|
| 38 |
+
### ModernBERT Architecture
|
| 39 |
+
ModernBERT introduces extended attention and hierarchical encoding to handle long documents, structured text, and source code efficiently. It supports hybrid tokenization for both natural language and code, enabling multi-modal reasoning and long-range dependency modeling—critical for cybersecurity tasks.
|
| 40 |
+
|
| 41 |
+
### Pretraining Dataset
|
| 42 |
+
SecureBERT 2.0 was pretrained on a large and diverse corpus, approximately **13.6B text tokens** and **53.3M code tokens**, over **13× larger** than the original SecureBERT. The dataset includes:
|
| 43 |
+
|
| 44 |
+
| Dataset Category | Description |
|
| 45 |
+
|-----------------|-------------|
|
| 46 |
+
| Seed corpus | High-quality curated security articles, reports, and technical blogs |
|
| 47 |
+
| Large-scale web text | Open web content filtered for cybersecurity relevance |
|
| 48 |
+
| Reasoning-focused data | Security-oriented QA and reasoning datasets |
|
| 49 |
+
| Instruction-tuning data | Procedural and instructional texts for cybersecurity workflows |
|
| 50 |
+
| Code vulnerability corpus | Annotated open-source code focused on vulnerabilities |
|
| 51 |
+
| Cybersecurity dialogue data | Security conversations, Q&A, and analyst workflows |
|
| 52 |
+
| Original baseline dataset | Data from the first SecureBERT for continuity |
|
| 53 |
+
|
| 54 |
+
#### Dataset Statistics
|
| 55 |
+
|
| 56 |
+
| Dataset Category | Code Tokens | Text Tokens |
|
| 57 |
+
|-----------------|------------|------------|
|
| 58 |
+
| Seed corpus | 9,406,451 | 256,859,788 |
|
| 59 |
+
| Large-scale web text | 268,993 | 12,231,942,693 |
|
| 60 |
+
| Reasoning-focused data | -- | 3,229,293 |
|
| 61 |
+
| Instruction-tuning data | 61,590 | 2,336,218 |
|
| 62 |
+
| Code vulnerability corpus | 2,146,875 | -- |
|
| 63 |
+
| Cybersecurity dialogue data | 41,503,749 | 56,871,556 |
|
| 64 |
+
| Original baseline dataset | -- | 1,072,798,637 |
|
| 65 |
+
| **Total** | 53,387,658 | 13,623,037,185 |
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
### Pretraining Objectives and Strategies
|
| 70 |
+
- **Masked Language Modeling (MLM):** Random tokens in text and code are masked for prediction. Code-specific tokens (identifiers, operators) are masked to improve program understanding.
|
| 71 |
+
- **Microannealing Curriculum:** Gradually introduces diverse datasets, balancing high-quality and challenging data for optimal learning.
|
| 72 |
+
- **Optimization:** AdamW optimizer, learning rate of 5e-5, weight decay 0.01, MLM probability 0.10, 20 epochs, per-GPU batch size of 16 across 8 GPUs.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## Performance Evaluation
|
| 77 |
+
|
| 78 |
+
SecureBERT 2.0 was evaluated on masked language modeling tasks across objects (nouns), actions (verbs), and code tokens:
|
| 79 |
+
|
| 80 |
+
| Top-n | Objects (Nouns) | Verbs (Actions) | Code Tokens |
|
| 81 |
+
|-------|-----------------|----------------|-------------|
|
| 82 |
+
| 1 | 56.20% | 45.02% | 39.27% |
|
| 83 |
+
| 2 | 69.73% | 60.00% | 46.90% |
|
| 84 |
+
| 3 | 75.85% | 66.68% | 50.87% |
|
| 85 |
+
| 4 | 80.01% | 71.56% | 53.36% |
|
| 86 |
+
| 5 | 82.72% | 74.12% | 55.41% |
|
| 87 |
+
| 10 | 88.80% | 81.64% | 60.03% |
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
This figure presents a comparative study of SecureBERT 2.0, SecureBERT, and ModernBERT on the masked language modeling (MLM) task. This shows SecureBERT 2.0 outperforms both the original SecureBERT and generic ModernBERT, particularly in code understanding and domain-specific terms.
|
| 92 |
+

|
| 93 |
+
|
| 94 |
+
## Usage
|
| 95 |
+
```bash
|
| 96 |
+
pip install transformers
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
Load and use
|
| 100 |
+
```python
|
| 101 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
| 102 |
+
|
| 103 |
+
model_name = "CiscoAITeam/SecureBERT2.0-base"
|
| 104 |
+
|
| 105 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 106 |
+
model = AutoModelForMaskedLM.from_pretrained(model_name)
|
| 107 |
+
|
| 108 |
+
# Example masked sentence
|
| 109 |
+
text = "The malware exploits a vulnerability in the [MASK] system."
|
| 110 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 111 |
+
outputs = model(**inputs)
|
| 112 |
+
|
| 113 |
+
# Get predictions
|
| 114 |
+
predicted_token_id = outputs.logits.argmax(-1)
|
| 115 |
+
predicted_word = tokenizer.decode(predicted_token_id[0])
|
| 116 |
+
print(predicted_word)
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
```
|