cisco-ehsan commited on
Commit
d3f6219
·
verified ·
1 Parent(s): 2c3160a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -75
README.md CHANGED
@@ -2,119 +2,102 @@
2
  license: apache-2.0
3
  language:
4
  - en
5
- base_model:
6
- - answerdotai/ModernBERT-base
7
- pipeline_tag: fill-mask
8
- library_name: transformers
9
  tags:
 
 
 
 
10
  - cybersecurity
11
  - ciscoAITeam
12
- - Cyber
13
  - CTI
 
 
 
 
 
 
 
14
  ---
15
 
 
16
 
17
- # SecureBERT 2.0 Base Model
18
-
19
- [SecureBERT 2.0](https://arxiv.org/pdf/2510.00240) is a **domain-specific transformer model** built on top of ModernBERT, optimized for cybersecurity tasks. It produces contextualized embeddings for technical text and code, enabling applications such as masked language modeling, semantic search, named entity recognition, vulnerability detection, and code analysis.
20
-
21
- Cybersecurity data is highly technical, heterogeneous, and rapidly evolving. SecureBERT 2.0 leverages domain-specific pretraining to capture complex, jargon-heavy, and context-dependent language. It integrates natural language text from threat reports, blogs, technical documentation, and source code, providing superior representations for tasks requiring deep understanding of both language and code.
22
 
23
  ---
24
 
25
  ## Model Details
26
 
27
- - **Architecture:** ModernBERT
28
- - **Base Model:** answerdotai/ModernBERT-base
29
- - **Pipeline Task:** fill-mask
30
- - **Max Sequence Length:** 1024 tokens
 
 
31
  - **Language:** English
32
- - **License:** Apache-2.0
 
33
 
34
- ---
35
 
36
- ## Pretraining
37
-
38
- ### ModernBERT Architecture
39
- ModernBERT introduces extended attention and hierarchical encoding to handle long documents, structured text, and source code efficiently. It supports hybrid tokenization for both natural language and code, enabling multi-modal reasoning and long-range dependency modeling—critical for cybersecurity tasks.
40
-
41
- ### Pretraining Dataset
42
- SecureBERT 2.0 was pretrained on a large and diverse corpus, approximately **13.6B text tokens** and **53.3M code tokens**, over **13× larger** than the original SecureBERT. The dataset includes:
43
-
44
- | Dataset Category | Description |
45
- |-----------------|-------------|
46
- | Seed corpus | High-quality curated security articles, reports, and technical blogs |
47
- | Large-scale web text | Open web content filtered for cybersecurity relevance |
48
- | Reasoning-focused data | Security-oriented QA and reasoning datasets |
49
- | Instruction-tuning data | Procedural and instructional texts for cybersecurity workflows |
50
- | Code vulnerability corpus | Annotated open-source code focused on vulnerabilities |
51
- | Cybersecurity dialogue data | Security conversations, Q&A, and analyst workflows |
52
- | Original baseline dataset | Data from the first SecureBERT for continuity |
53
-
54
- #### Dataset Statistics
55
-
56
- | Dataset Category | Code Tokens | Text Tokens |
57
- |-----------------|------------|------------|
58
- | Seed corpus | 9,406,451 | 256,859,788 |
59
- | Large-scale web text | 268,993 | 12,231,942,693 |
60
- | Reasoning-focused data | -- | 3,229,293 |
61
- | Instruction-tuning data | 61,590 | 2,336,218 |
62
- | Code vulnerability corpus | 2,146,875 | -- |
63
- | Cybersecurity dialogue data | 41,503,749 | 56,871,556 |
64
- | Original baseline dataset | -- | 1,072,798,637 |
65
- | **Total** | 53,387,658 | 13,623,037,185 |
66
 
67
  ---
68
 
69
- ### Pretraining Objectives and Strategies
70
- - **Masked Language Modeling (MLM):** Random tokens in text and code are masked for prediction. Code-specific tokens (identifiers, operators) are masked to improve program understanding.
71
- - **Microannealing Curriculum:** Gradually introduces diverse datasets, balancing high-quality and challenging data for optimal learning.
72
- - **Optimization:** AdamW optimizer, learning rate of 5e-5, weight decay 0.01, MLM probability 0.10, 20 epochs, per-GPU batch size of 16 across 8 GPUs.
73
 
74
- ---
75
 
76
- ## Performance Evaluation
 
 
 
 
77
 
78
- SecureBERT 2.0 was evaluated on masked language modeling tasks across objects (nouns), actions (verbs), and code tokens:
79
 
80
- | Top-n | Objects (Nouns) | Verbs (Actions) | Code Tokens |
81
- |-------|-----------------|----------------|-------------|
82
- | 1 | 56.20% | 45.02% | 39.27% |
83
- | 2 | 69.73% | 60.00% | 46.90% |
84
- | 3 | 75.85% | 66.68% | 50.87% |
85
- | 4 | 80.01% | 71.56% | 53.36% |
86
- | 5 | 82.72% | 74.12% | 55.41% |
87
- | 10 | 88.80% | 81.64% | 60.03% |
88
 
 
89
 
 
 
 
90
 
91
- This figure presents a comparative study of SecureBERT 2.0, SecureBERT, and ModernBERT on the masked language modeling (MLM) task. This shows SecureBERT 2.0 outperforms both the original SecureBERT and generic ModernBERT, particularly in code understanding and domain-specific terms.
92
- ![image](https://cdn-uploads.huggingface.co/production/uploads/661030b81d2d202e24567c37/o0hxtirn-LV_omHsnBXhd.png)
 
 
 
 
 
 
 
 
 
 
93
 
94
- ## Usage
95
- ```bash
96
- pip install transformers
97
- ```
98
 
99
- Load and use
100
  ```python
101
  from transformers import AutoModelForMaskedLM, AutoTokenizer
102
 
103
  model_name = "CiscoAITeam/SecureBERT2.0-base"
104
-
105
  tokenizer = AutoTokenizer.from_pretrained(model_name)
106
  model = AutoModelForMaskedLM.from_pretrained(model_name)
107
 
108
- # Example masked sentence
109
  text = "The malware exploits a vulnerability in the [MASK] system."
110
  inputs = tokenizer(text, return_tensors="pt")
111
  outputs = model(**inputs)
112
 
113
- # Get predictions
114
  predicted_token_id = outputs.logits.argmax(-1)
115
  predicted_word = tokenizer.decode(predicted_token_id[0])
116
  print(predicted_word)
117
-
118
-
119
- ```
120
- SecureBERT 2.0 outperforms both the original SecureBERT and generic ModernBERT, particularly in code understanding and domain-specific terms.
 
2
  license: apache-2.0
3
  language:
4
  - en
 
 
 
 
5
  tags:
6
+ - fill-mask
7
+ - transformers
8
+ - safetensors
9
+ - modernbert
10
  - cybersecurity
11
  - ciscoAITeam
12
+ - code
13
  - CTI
14
+ datasets:
15
+ - custom
16
+ library_name: transformers
17
+ pipeline_tag: fill-mask
18
+ model-index:
19
+ - name: SecureBERT2.0-base
20
+ results: []
21
  ---
22
 
23
+ # Model Card for CiscoAITeam/SecureBERT2.0-base
24
 
25
+ SecureBERT 2.0 Base is a domain-specific transformer model optimized for cybersecurity tasks. It extends the ModernBERT architecture with cybersecurity-focused pretraining to produce contextualized embeddings for both technical text and code. SecureBERT 2.0 supports tasks like masked language modeling, semantic search, named entity recognition, vulnerability detection, and code analysis.
 
 
 
 
26
 
27
  ---
28
 
29
  ## Model Details
30
 
31
+ ### Model Description
32
+
33
+ SecureBERT 2.0 Base is designed for **deep contextual understanding of cybersecurity language and code**. It leverages domain-specific pretraining on a large, heterogeneous corpus covering threat reports, blogs, documentation, and codebases, making it effective for reasoning across natural language and programming syntax.
34
+
35
+ - **Developed by:** Cisco AI Team
36
+ - **Model type:** Transformer (ModernBERT architecture)
37
  - **Language:** English
38
+ - **License:** Apache 2.0
39
+ - **Finetuned from model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
40
 
41
+ ### Model Sources
42
 
43
+ - **Repository:** [https://huggingface.co/CiscoAITeam/SecureBERT2.0-base](https://huggingface.co/CiscoAITeam/SecureBERT2.0-base)
44
+ - **Paper:** [arXiv:2510.00240](https://arxiv.org/abs/2510.00240)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ---
47
 
48
+ ## Uses
 
 
 
49
 
50
+ ### Direct Use
51
 
52
+ - Masked language modeling for cybersecurity text and code
53
+ - Embedding generation for semantic search and retrieval
54
+ - Code and text feature extraction for downstream classification or clustering
55
+ - Named entity recognition (NER) on security-related entities
56
+ - Vulnerability detection in source code
57
 
58
+ ### Downstream Use
59
 
60
+ Fine-tuning for:
61
+ - Threat intelligence extraction
62
+ - Security question answering
63
+ - Incident analysis and summarization
64
+ - Automated code review and vulnerability prediction
 
 
 
65
 
66
+ ### Out-of-Scope Use
67
 
68
+ - Non-English or non-technical text
69
+ - General-purpose conversational AI
70
+ - Decision-making in real-time security systems without human oversight
71
 
72
+ ---
73
+
74
+ ## Bias, Risks, and Limitations
75
+
76
+ The model reflects biases in the cybersecurity sources it was trained on, which may include:
77
+ - Overrepresentation of certain threat actors, technologies, or organizations
78
+ - Inconsistent code or documentation quality
79
+ - Limited exposure to non-public or proprietary data formats
80
+
81
+ ### Recommendations
82
+
83
+ Users should evaluate outputs in their specific context and avoid automated high-stakes decisions without expert validation.
84
 
85
+ ---
86
+
87
+ ## How to Get Started with the Model
 
88
 
 
89
  ```python
90
  from transformers import AutoModelForMaskedLM, AutoTokenizer
91
 
92
  model_name = "CiscoAITeam/SecureBERT2.0-base"
 
93
  tokenizer = AutoTokenizer.from_pretrained(model_name)
94
  model = AutoModelForMaskedLM.from_pretrained(model_name)
95
 
 
96
  text = "The malware exploits a vulnerability in the [MASK] system."
97
  inputs = tokenizer(text, return_tensors="pt")
98
  outputs = model(**inputs)
99
 
 
100
  predicted_token_id = outputs.logits.argmax(-1)
101
  predicted_word = tokenizer.decode(predicted_token_id[0])
102
  print(predicted_word)
103
+ ```