Update README.md
Browse files
README.md
CHANGED
|
@@ -100,4 +100,90 @@ outputs = model(**inputs)
|
|
| 100 |
predicted_token_id = outputs.logits.argmax(-1)
|
| 101 |
predicted_word = tokenizer.decode(predicted_token_id[0])
|
| 102 |
print(predicted_word)
|
| 103 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
predicted_token_id = outputs.logits.argmax(-1)
|
| 101 |
predicted_word = tokenizer.decode(predicted_token_id[0])
|
| 102 |
print(predicted_word)
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
## Training Details
|
| 106 |
+
|
| 107 |
+
### Training Procedure
|
| 108 |
+
|
| 109 |
+
#### Preprocessing
|
| 110 |
+
|
| 111 |
+
Hybrid tokenization for text and code (natural language + structured syntax).
|
| 112 |
+
|
| 113 |
+
#### Training Hyperparameters
|
| 114 |
+
|
| 115 |
+
- **Objective:** Masked Language Modeling (MLM)
|
| 116 |
+
- **Masking probability:** 0.10
|
| 117 |
+
- **Optimizer:** AdamW
|
| 118 |
+
- **Learning rate:** 5e-5
|
| 119 |
+
- **Weight decay:** 0.01
|
| 120 |
+
- **Epochs:** 20
|
| 121 |
+
- **Batch size:** 16 per GPU × 8 GPUs
|
| 122 |
+
- **Curriculum:** Microannealing (gradual dataset diversification)
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## Evaluation
|
| 127 |
+
|
| 128 |
+
### Testing Data, Factors & Metrics
|
| 129 |
+
|
| 130 |
+
#### Testing Data
|
| 131 |
+
|
| 132 |
+
Internal held-out subset of cybersecurity and code corpora.
|
| 133 |
+
|
| 134 |
+
#### Factors
|
| 135 |
+
|
| 136 |
+
Evaluated across token categories:
|
| 137 |
+
- Objects (nouns)
|
| 138 |
+
- Actions (verbs)
|
| 139 |
+
- Code tokens
|
| 140 |
+
|
| 141 |
+
#### Metrics
|
| 142 |
+
|
| 143 |
+
Top-n accuracy on masked token prediction.
|
| 144 |
+
|
| 145 |
+
### Results
|
| 146 |
+
|
| 147 |
+
| Top-n | Objects (Nouns) | Verbs (Actions) | Code Tokens |
|
| 148 |
+
|:------|:---------------:|:----------------:|:-------------:|
|
| 149 |
+
| 1 | 56.20 % | 45.02 % | 39.27 % |
|
| 150 |
+
| 2 | 69.73 % | 60.00 % | 46.90 % |
|
| 151 |
+
| 3 | 75.85 % | 66.68 % | 50.87 % |
|
| 152 |
+
| 4 | 80.01 % | 71.56 % | 53.36 % |
|
| 153 |
+
| 5 | 82.72 % | 74.12 % | 55.41 % |
|
| 154 |
+
| 10 | 88.80 % | 81.64 % | 60.03 % |
|
| 155 |
+
|
| 156 |
+
#### Summary
|
| 157 |
+
|
| 158 |
+
SecureBERT 2.0 outperforms both the original SecureBERT and ModernBERT on cybersecurity-specific and code-related tasks.
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
## Environmental Impact
|
| 163 |
+
|
| 164 |
+
- **Hardware Type:** 8× GPU cluster
|
| 165 |
+
- **Hours used:** [Information Not Available]
|
| 166 |
+
- **Cloud Provider:** [Information Not Available]
|
| 167 |
+
- **Compute Region:** [Information Not Available]
|
| 168 |
+
- **Carbon Emitted:** [Estimate Not Available]
|
| 169 |
+
|
| 170 |
+
Carbon footprint can be estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
## Technical Specifications
|
| 175 |
+
|
| 176 |
+
### Model Architecture and Objective
|
| 177 |
+
|
| 178 |
+
- **Architecture:** ModernBERT
|
| 179 |
+
- **Max sequence length:** 1024 tokens
|
| 180 |
+
- **Parameters:** 150 M
|
| 181 |
+
- **Objective:** Masked Language Modeling (MLM)
|
| 182 |
+
- **Tensor type:** F32
|
| 183 |
+
|
| 184 |
+
### Compute Infrastructure
|
| 185 |
+
|
| 186 |
+
- **Framework:** Transformers (PyTorch)
|
| 187 |
+
- **Mixed Precision:** fp32
|
| 188 |
+
- **Hardware:** 8 GPUs
|
| 189 |
+
- **Checkpoint Format:** Safetensors
|