cisco-ehsan commited on
Commit
b9f07b8
·
verified ·
1 Parent(s): a3d5fcb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -1
README.md CHANGED
@@ -11,4 +11,109 @@ tags:
11
  - ciscoAITeam
12
  - Cyber
13
  - CTI
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - ciscoAITeam
12
  - Cyber
13
  - CTI
14
+ ---
15
+
16
+
17
+ # SecureBERT 2.0 Base Model
18
+
19
+ SecureBERT 2.0 is a **domain-specific transformer model** built on top of ModernBERT, optimized for cybersecurity tasks. It produces contextualized embeddings for technical text and code, enabling applications such as masked language modeling, semantic search, named entity recognition, vulnerability detection, and code analysis.
20
+
21
+ Cybersecurity data is highly technical, heterogeneous, and rapidly evolving. SecureBERT 2.0 leverages domain-specific pretraining to capture complex, jargon-heavy, and context-dependent language. It integrates natural language text from threat reports, blogs, technical documentation, and source code, providing superior representations for tasks requiring deep understanding of both language and code.
22
+
23
+ ---
24
+
25
+ ## Model Details
26
+
27
+ - **Architecture:** ModernBERT
28
+ - **Base Model:** answerdotai/ModernBERT-base
29
+ - **Pipeline Task:** fill-mask
30
+ - **Max Sequence Length:** 1024 tokens
31
+ - **Language:** English
32
+ - **License:** Apache-2.0
33
+
34
+ ---
35
+
36
+ ## Pretraining
37
+
38
+ ### ModernBERT Architecture
39
+ ModernBERT introduces extended attention and hierarchical encoding to handle long documents, structured text, and source code efficiently. It supports hybrid tokenization for both natural language and code, enabling multi-modal reasoning and long-range dependency modeling—critical for cybersecurity tasks.
40
+
41
+ ### Pretraining Dataset
42
+ SecureBERT 2.0 was pretrained on a large and diverse corpus, approximately **13.6B text tokens** and **53.3M code tokens**, over **13× larger** than the original SecureBERT. The dataset includes:
43
+
44
+ | Dataset Category | Description |
45
+ |-----------------|-------------|
46
+ | Seed corpus | High-quality curated security articles, reports, and technical blogs |
47
+ | Large-scale web text | Open web content filtered for cybersecurity relevance |
48
+ | Reasoning-focused data | Security-oriented QA and reasoning datasets |
49
+ | Instruction-tuning data | Procedural and instructional texts for cybersecurity workflows |
50
+ | Code vulnerability corpus | Annotated open-source code focused on vulnerabilities |
51
+ | Cybersecurity dialogue data | Security conversations, Q&A, and analyst workflows |
52
+ | Original baseline dataset | Data from the first SecureBERT for continuity |
53
+
54
+ #### Dataset Statistics
55
+
56
+ | Dataset Category | Code Tokens | Text Tokens |
57
+ |-----------------|------------|------------|
58
+ | Seed corpus | 9,406,451 | 256,859,788 |
59
+ | Large-scale web text | 268,993 | 12,231,942,693 |
60
+ | Reasoning-focused data | -- | 3,229,293 |
61
+ | Instruction-tuning data | 61,590 | 2,336,218 |
62
+ | Code vulnerability corpus | 2,146,875 | -- |
63
+ | Cybersecurity dialogue data | 41,503,749 | 56,871,556 |
64
+ | Original baseline dataset | -- | 1,072,798,637 |
65
+ | **Total** | 53,387,658 | 13,623,037,185 |
66
+
67
+ ---
68
+
69
+ ### Pretraining Objectives and Strategies
70
+ - **Masked Language Modeling (MLM):** Random tokens in text and code are masked for prediction. Code-specific tokens (identifiers, operators) are masked to improve program understanding.
71
+ - **Microannealing Curriculum:** Gradually introduces diverse datasets, balancing high-quality and challenging data for optimal learning.
72
+ - **Optimization:** AdamW optimizer, learning rate of 5e-5, weight decay 0.01, MLM probability 0.10, 20 epochs, per-GPU batch size of 16 across 8 GPUs.
73
+
74
+ ---
75
+
76
+ ## Performance Evaluation
77
+
78
+ SecureBERT 2.0 was evaluated on masked language modeling tasks across objects (nouns), actions (verbs), and code tokens:
79
+
80
+ | Top-n | Objects (Nouns) | Verbs (Actions) | Code Tokens |
81
+ |-------|-----------------|----------------|-------------|
82
+ | 1 | 56.20% | 45.02% | 39.27% |
83
+ | 2 | 69.73% | 60.00% | 46.90% |
84
+ | 3 | 75.85% | 66.68% | 50.87% |
85
+ | 4 | 80.01% | 71.56% | 53.36% |
86
+ | 5 | 82.72% | 74.12% | 55.41% |
87
+ | 10 | 88.80% | 81.64% | 60.03% |
88
+
89
+
90
+
91
+ This figure presents a comparative study of SecureBERT 2.0, SecureBERT, and ModernBERT on the masked language modeling (MLM) task. This shows SecureBERT 2.0 outperforms both the original SecureBERT and generic ModernBERT, particularly in code understanding and domain-specific terms.
92
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/661030b81d2d202e24567c37/o0hxtirn-LV_omHsnBXhd.png)
93
+
94
+ ## Usage
95
+ ```bash
96
+ pip install transformers
97
+ ```
98
+
99
+ Load and use
100
+ ```python
101
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
102
+
103
+ model_name = "CiscoAITeam/SecureBERT2.0-base"
104
+
105
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
106
+ model = AutoModelForMaskedLM.from_pretrained(model_name)
107
+
108
+ # Example masked sentence
109
+ text = "The malware exploits a vulnerability in the [MASK] system."
110
+ inputs = tokenizer(text, return_tensors="pt")
111
+ outputs = model(**inputs)
112
+
113
+ # Get predictions
114
+ predicted_token_id = outputs.logits.argmax(-1)
115
+ predicted_word = tokenizer.decode(predicted_token_id[0])
116
+ print(predicted_word)
117
+
118
+
119
+ ```