cisco-ehsan commited on
Commit
9186e1b
·
verified ·
1 Parent(s): d3f6219

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -1
README.md CHANGED
@@ -100,4 +100,90 @@ outputs = model(**inputs)
100
  predicted_token_id = outputs.logits.argmax(-1)
101
  predicted_word = tokenizer.decode(predicted_token_id[0])
102
  print(predicted_word)
103
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  predicted_token_id = outputs.logits.argmax(-1)
101
  predicted_word = tokenizer.decode(predicted_token_id[0])
102
  print(predicted_word)
103
+ ```
104
+
105
+ ## Training Details
106
+
107
+ ### Training Procedure
108
+
109
+ #### Preprocessing
110
+
111
+ Hybrid tokenization for text and code (natural language + structured syntax).
112
+
113
+ #### Training Hyperparameters
114
+
115
+ - **Objective:** Masked Language Modeling (MLM)
116
+ - **Masking probability:** 0.10
117
+ - **Optimizer:** AdamW
118
+ - **Learning rate:** 5e-5
119
+ - **Weight decay:** 0.01
120
+ - **Epochs:** 20
121
+ - **Batch size:** 16 per GPU × 8 GPUs
122
+ - **Curriculum:** Microannealing (gradual dataset diversification)
123
+
124
+ ---
125
+
126
+ ## Evaluation
127
+
128
+ ### Testing Data, Factors & Metrics
129
+
130
+ #### Testing Data
131
+
132
+ Internal held-out subset of cybersecurity and code corpora.
133
+
134
+ #### Factors
135
+
136
+ Evaluated across token categories:
137
+ - Objects (nouns)
138
+ - Actions (verbs)
139
+ - Code tokens
140
+
141
+ #### Metrics
142
+
143
+ Top-n accuracy on masked token prediction.
144
+
145
+ ### Results
146
+
147
+ | Top-n | Objects (Nouns) | Verbs (Actions) | Code Tokens |
148
+ |:------|:---------------:|:----------------:|:-------------:|
149
+ | 1 | 56.20 % | 45.02 % | 39.27 % |
150
+ | 2 | 69.73 % | 60.00 % | 46.90 % |
151
+ | 3 | 75.85 % | 66.68 % | 50.87 % |
152
+ | 4 | 80.01 % | 71.56 % | 53.36 % |
153
+ | 5 | 82.72 % | 74.12 % | 55.41 % |
154
+ | 10 | 88.80 % | 81.64 % | 60.03 % |
155
+
156
+ #### Summary
157
+
158
+ SecureBERT 2.0 outperforms both the original SecureBERT and ModernBERT on cybersecurity-specific and code-related tasks.
159
+
160
+ ---
161
+
162
+ ## Environmental Impact
163
+
164
+ - **Hardware Type:** 8× GPU cluster
165
+ - **Hours used:** [Information Not Available]
166
+ - **Cloud Provider:** [Information Not Available]
167
+ - **Compute Region:** [Information Not Available]
168
+ - **Carbon Emitted:** [Estimate Not Available]
169
+
170
+ Carbon footprint can be estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
171
+
172
+ ---
173
+
174
+ ## Technical Specifications
175
+
176
+ ### Model Architecture and Objective
177
+
178
+ - **Architecture:** ModernBERT
179
+ - **Max sequence length:** 1024 tokens
180
+ - **Parameters:** 150 M
181
+ - **Objective:** Masked Language Modeling (MLM)
182
+ - **Tensor type:** F32
183
+
184
+ ### Compute Infrastructure
185
+
186
+ - **Framework:** Transformers (PyTorch)
187
+ - **Mixed Precision:** fp32
188
+ - **Hardware:** 8 GPUs
189
+ - **Checkpoint Format:** Safetensors