browsesafe / README.md
jma127's picture
Update README.md
26c62a9 verified
---
license: mit
datasets:
- perplexity-ai/browsesafe-bench
language:
- en
metrics:
- f1
- precision
- recall
base_model:
- Qwen/Qwen3-30B-A3B-Instruct-2507
library_name: transformers
tags:
- security
- prompt-injection
- ai-safety
- browser-agents
- html
---
# BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents
## Highlights
BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content.
- **State-of-the-Art Detection**: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set.
- **Real-Time Latency**: Optimized for agent loops, enabling async security checks without degrading user experience.
- **Robustness to Distractors**: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors.
- **Comprehensive Coverage**: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities.
## Model Overview
BrowseSafe is based on the Qwen3-30B-A3B architecture.
- **Type**: Fine-tuned Causal Language Model (MoE) for SFT Classification
- **Training Stage**: Post-training (Fine-tuning on BrowseSafe-Bench)
- **Dataset**: [BrowseSafe-Bench](https://huggingface.co/datasets/perplexity-ai/browsesafe-bench)
- **Base Model**: [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
- **Context Length**: Up to 16,384 tokens
- **Input**: Raw HTML content
- **Output**: Single token, "yes" or "no" classification
- **License**: MIT
## Performance
We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads.
| Model Name | Config | F1 Score | Precision | Recall | Balanced <br>Accuracy | Refusals |
|-------------------|---------------|----------|-----------|--------|-----------------------|----------|
| PromptGuard-2 | 22M | 0.350 | 0.975 | 0.213 | 0.606 | 0 |
| | 86M | 0.360 | 0.983 | 0.221 | 0.611 | 0 |
| gpt-oss-safeguard | 20B / Low | 0.790 | 0.986 | 0.658 | 0.826 | 0 |
| | 20B / Medium | 0.796 | 0.994 | 0.664 | 0.832 | 0 |
| | 120B / Low | 0.730 | 0.994 | 0.577 | 0.788 | 0 |
| | 120B / Medium | 0.741 | 0.997 | 0.589 | 0.795 | 0 |
| GPT-5 mini | Minimal | 0.750 | 0.735 | 0.767 | 0.746 | 0 |
| | Low | 0.854 | 0.949 | 0.776 | 0.868 | 0 |
| | Medium | 0.853 | 0.945 | 0.777 | 0.866 | 0 |
| | High | 0.852 | 0.957 | 0.768 | 0.868 | 0 |
| GPT-5 | Minimal | 0.849 | 0.881 | 0.819 | 0.855 | 0 |
| | Low | 0.854 | 0.928 | 0.791 | 0.866 | 0 |
| | Medium | 0.855 | 0.930 | 0.792 | 0.867 | 0 |
| | High | 0.840 | 0.882 | 0.802 | 0.848 | 0 |
| Haiku 4.5 | No Thinking | 0.810 | 0.760 | 0.866 | 0.798 | 0 |
| | 1K | 0.809 | 0.755 | 0.872 | 0.795 | 0 |
| | 8K | 0.805 | 0.751 | 0.868 | 0.792 | 0 |
| | 32K | 0.808 | 0.760 | 0.863 | 0.796 | 0 |
| Sonnet 4.5 | No Thinking | 0.807 | 0.763 | 0.855 | 0.796 | 419 |
| | 1K | 0.862 | 0.929 | 0.803 | 0.872 | 613 |
| | 8K | 0.863 | 0.931 | 0.805 | 0.873 | 650 |
| | 32K | 0.863 | 0.935 | 0.801 | 0.873 | 669 |
| BrowseSafe | | 0.904 | 0.978 | 0.841 | 0.912 | 0 |
## Evaluation Metrics
BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the [paper](https://arxiv.org/abs/2511.20597).
## Quickstart
The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using `transformers>=4.55.4`.
Below is a code snippet illustrating how to use BrowseSafe.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "perplexity-ai/browsesafe-bench"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "<html>...</html>"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(**model_inputs)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
```
## Processing Long HTML Contexts
Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit.
- **Strategy**: Partition the document into non-overlapping chunks at token boundaries.
- **Aggregation**: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed.
A reference implementation can be found [here](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/responsible_ai/prompt_guard/inference.py).
## Best Practices
To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation.
## Citation
If you use or reference this work, please cite:
```bibtex
@article{browsesafe2025,
title = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents},
author = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li},
eprint = {arXiv:2511.20597},
archivePrefix= {arXiv},
year = {2025}
}