cisco-ai
/

SecureBERT2.0-base

Model card Files Files and versions

cisco-ehsan commited on Oct 13

Commit

9186e1b

·

verified ·

1 Parent(s): d3f6219

Update README.md

Files changed (1) hide show

README.md +87 -1

README.md CHANGED Viewed

@@ -100,4 +100,90 @@ outputs = model(**inputs)
 predicted_token_id = outputs.logits.argmax(-1)
 predicted_word = tokenizer.decode(predicted_token_id[0])
 print(predicted_word)
-```

 predicted_token_id = outputs.logits.argmax(-1)
 predicted_word = tokenizer.decode(predicted_token_id[0])
 print(predicted_word)
+```
+## Training Details
+### Training Procedure
+#### Preprocessing
+Hybrid tokenization for text and code (natural language + structured syntax).
+#### Training Hyperparameters
+- **Objective:** Masked Language Modeling (MLM)
+- **Masking probability:** 0.10
+- **Optimizer:** AdamW
+- **Learning rate:** 5e-5
+- **Weight decay:** 0.01
+- **Epochs:** 20
+- **Batch size:** 16 per GPU × 8 GPUs
+- **Curriculum:** Microannealing (gradual dataset diversification)
+---
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+Internal held-out subset of cybersecurity and code corpora.
+#### Factors
+Evaluated across token categories:
+- Objects (nouns)
+- Actions (verbs)
+- Code tokens
+#### Metrics
+Top-n accuracy on masked token prediction.
+### Results
+| Top-n | Objects (Nouns) | Verbs (Actions) | Code Tokens |
+|:------|:---------------:|:----------------:|:-------------:|
+| 1 | 56.20 % | 45.02 % | 39.27 % |
+| 2 | 69.73 % | 60.00 % | 46.90 % |
+| 3 | 75.85 % | 66.68 % | 50.87 % |
+| 4 | 80.01 % | 71.56 % | 53.36 % |
+| 5 | 82.72 % | 74.12 % | 55.41 % |
+| 10 | 88.80 % | 81.64 % | 60.03 % |
+#### Summary
+SecureBERT 2.0 outperforms both the original SecureBERT and ModernBERT on cybersecurity-specific and code-related tasks.
+---
+## Environmental Impact
+- **Hardware Type:** 8× GPU cluster
+- **Hours used:** [Information Not Available]
+- **Cloud Provider:** [Information Not Available]
+- **Compute Region:** [Information Not Available]
+- **Carbon Emitted:** [Estimate Not Available]
+Carbon footprint can be estimated using [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+---
+## Technical Specifications
+### Model Architecture and Objective
+- **Architecture:** ModernBERT
+- **Max sequence length:** 1024 tokens
+- **Parameters:** 150 M
+- **Objective:** Masked Language Modeling (MLM)
+- **Tensor type:** F32
+### Compute Infrastructure
+- **Framework:** Transformers (PyTorch)
+- **Mixed Precision:** fp32
+- **Hardware:** 8 GPUs
+- **Checkpoint Format:** Safetensors