|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- perplexity-ai/browsesafe-bench |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
base_model: |
|
|
- Qwen/Qwen3-30B-A3B-Instruct-2507 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- security |
|
|
- prompt-injection |
|
|
- ai-safety |
|
|
- browser-agents |
|
|
- html |
|
|
--- |
|
|
|
|
|
# BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents |
|
|
|
|
|
## Highlights |
|
|
|
|
|
BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content. |
|
|
|
|
|
- **State-of-the-Art Detection**: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set. |
|
|
|
|
|
- **Real-Time Latency**: Optimized for agent loops, enabling async security checks without degrading user experience. |
|
|
|
|
|
- **Robustness to Distractors**: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors. |
|
|
|
|
|
- **Comprehensive Coverage**: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
BrowseSafe is based on the Qwen3-30B-A3B architecture. |
|
|
|
|
|
- **Type**: Fine-tuned Causal Language Model (MoE) for SFT Classification |
|
|
- **Training Stage**: Post-training (Fine-tuning on BrowseSafe-Bench) |
|
|
- **Dataset**: [BrowseSafe-Bench](https://huggingface.co/datasets/perplexity-ai/browsesafe-bench) |
|
|
- **Base Model**: [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) |
|
|
- **Context Length**: Up to 16,384 tokens |
|
|
- **Input**: Raw HTML content |
|
|
- **Output**: Single token, "yes" or "no" classification |
|
|
- **License**: MIT |
|
|
|
|
|
## Performance |
|
|
|
|
|
We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads. |
|
|
|
|
|
| Model Name | Config | F1 Score | Precision | Recall | Balanced <br>Accuracy | Refusals | |
|
|
|-------------------|---------------|----------|-----------|--------|-----------------------|----------| |
|
|
| PromptGuard-2 | 22M | 0.350 | 0.975 | 0.213 | 0.606 | 0 | |
|
|
| | 86M | 0.360 | 0.983 | 0.221 | 0.611 | 0 | |
|
|
| gpt-oss-safeguard | 20B / Low | 0.790 | 0.986 | 0.658 | 0.826 | 0 | |
|
|
| | 20B / Medium | 0.796 | 0.994 | 0.664 | 0.832 | 0 | |
|
|
| | 120B / Low | 0.730 | 0.994 | 0.577 | 0.788 | 0 | |
|
|
| | 120B / Medium | 0.741 | 0.997 | 0.589 | 0.795 | 0 | |
|
|
| GPT-5 mini | Minimal | 0.750 | 0.735 | 0.767 | 0.746 | 0 | |
|
|
| | Low | 0.854 | 0.949 | 0.776 | 0.868 | 0 | |
|
|
| | Medium | 0.853 | 0.945 | 0.777 | 0.866 | 0 | |
|
|
| | High | 0.852 | 0.957 | 0.768 | 0.868 | 0 | |
|
|
| GPT-5 | Minimal | 0.849 | 0.881 | 0.819 | 0.855 | 0 | |
|
|
| | Low | 0.854 | 0.928 | 0.791 | 0.866 | 0 | |
|
|
| | Medium | 0.855 | 0.930 | 0.792 | 0.867 | 0 | |
|
|
| | High | 0.840 | 0.882 | 0.802 | 0.848 | 0 | |
|
|
| Haiku 4.5 | No Thinking | 0.810 | 0.760 | 0.866 | 0.798 | 0 | |
|
|
| | 1K | 0.809 | 0.755 | 0.872 | 0.795 | 0 | |
|
|
| | 8K | 0.805 | 0.751 | 0.868 | 0.792 | 0 | |
|
|
| | 32K | 0.808 | 0.760 | 0.863 | 0.796 | 0 | |
|
|
| Sonnet 4.5 | No Thinking | 0.807 | 0.763 | 0.855 | 0.796 | 419 | |
|
|
| | 1K | 0.862 | 0.929 | 0.803 | 0.872 | 613 | |
|
|
| | 8K | 0.863 | 0.931 | 0.805 | 0.873 | 650 | |
|
|
| | 32K | 0.863 | 0.935 | 0.801 | 0.873 | 669 | |
|
|
| BrowseSafe | | 0.904 | 0.978 | 0.841 | 0.912 | 0 | |
|
|
|
|
|
## Evaluation Metrics |
|
|
|
|
|
BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the [paper](https://arxiv.org/abs/2511.20597). |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using `transformers>=4.55.4`. |
|
|
|
|
|
Below is a code snippet illustrating how to use BrowseSafe. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "perplexity-ai/browsesafe-bench" |
|
|
|
|
|
# load the tokenizer and the model |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# prepare the model input |
|
|
prompt = "<html>...</html>" |
|
|
messages = [ |
|
|
{"role": "user", "content": prompt} |
|
|
] |
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
) |
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
# conduct text completion |
|
|
generated_ids = model.generate(**model_inputs) |
|
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() |
|
|
|
|
|
content = tokenizer.decode(output_ids, skip_special_tokens=True) |
|
|
|
|
|
print("content:", content) |
|
|
``` |
|
|
## Processing Long HTML Contexts |
|
|
|
|
|
Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit. |
|
|
|
|
|
- **Strategy**: Partition the document into non-overlapping chunks at token boundaries. |
|
|
- **Aggregation**: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed. |
|
|
|
|
|
A reference implementation can be found [here](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/responsible_ai/prompt_guard/inference.py). |
|
|
|
|
|
## Best Practices |
|
|
|
|
|
To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use or reference this work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{browsesafe2025, |
|
|
title = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents}, |
|
|
author = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li}, |
|
|
eprint = {arXiv:2511.20597}, |
|
|
archivePrefix= {arXiv}, |
|
|
year = {2025} |
|
|
} |