File size: 7,253 Bytes
0abcd11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be466f1
0abcd11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be466f1
 
 
 
 
 
 
26c62a9
be466f1
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
license: mit
datasets:
- perplexity-ai/browsesafe-bench
language:
- en
metrics:
- f1
- precision
- recall
base_model:
- Qwen/Qwen3-30B-A3B-Instruct-2507
library_name: transformers
tags:
- security
- prompt-injection
- ai-safety
- browser-agents
- html
---

# BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents

## Highlights

BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content.

- **State-of-the-Art Detection**: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set.

- **Real-Time Latency**: Optimized for agent loops, enabling async security checks without degrading user experience.

- **Robustness to Distractors**: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors.

- **Comprehensive Coverage**: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities.

## Model Overview

BrowseSafe is based on the Qwen3-30B-A3B architecture.

- **Type**: Fine-tuned Causal Language Model (MoE) for SFT Classification
- **Training Stage**: Post-training (Fine-tuning on BrowseSafe-Bench)
- **Dataset**: [BrowseSafe-Bench](https://huggingface.co/datasets/perplexity-ai/browsesafe-bench)
- **Base Model**: [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
- **Context Length**: Up to 16,384 tokens
- **Input**: Raw HTML content
- **Output**: Single token, "yes" or "no" classification 
- **License**: MIT

## Performance

We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads.

| Model Name        | Config        | F1 Score | Precision | Recall | Balanced <br>Accuracy | Refusals |
|-------------------|---------------|----------|-----------|--------|-----------------------|----------|
| PromptGuard-2     | 22M           |    0.350 |     0.975 |  0.213 |                 0.606 |        0 |
|                   | 86M           |    0.360 |     0.983 |  0.221 |                 0.611 |        0 |
| gpt-oss-safeguard | 20B / Low     |    0.790 |     0.986 |  0.658 |                 0.826 |        0 |
|                   | 20B / Medium  |    0.796 |     0.994 |  0.664 |                 0.832 |        0 |
|                   | 120B / Low    |    0.730 |     0.994 |  0.577 |                 0.788 |        0 |
|                   | 120B / Medium |    0.741 |     0.997 |  0.589 |                 0.795 |        0 |
| GPT-5 mini        | Minimal       |    0.750 |     0.735 |  0.767 |                 0.746 |        0 |
|                   | Low           |    0.854 |     0.949 |  0.776 |                 0.868 |        0 |
|                   | Medium        |    0.853 |     0.945 |  0.777 |                 0.866 |        0 |
|                   | High          |    0.852 |     0.957 |  0.768 |                 0.868 |        0 |
| GPT-5             | Minimal       |    0.849 |     0.881 |  0.819 |                 0.855 |        0 |
|                   | Low           |    0.854 |     0.928 |  0.791 |                 0.866 |        0 |
|                   | Medium        |    0.855 |     0.930 |  0.792 |                 0.867 |        0 |
|                   | High          |    0.840 |     0.882 |  0.802 |                 0.848 |        0 |
| Haiku 4.5         | No Thinking   |    0.810 |     0.760 |  0.866 |                 0.798 |        0 |
|                   | 1K            |    0.809 |     0.755 |  0.872 |                 0.795 |        0 |
|                   | 8K            |    0.805 |     0.751 |  0.868 |                 0.792 |        0 |
|                   | 32K           |    0.808 |     0.760 |  0.863 |                 0.796 |        0 |
| Sonnet 4.5        | No Thinking   |    0.807 |     0.763 |  0.855 |                 0.796 |      419 |
|                   | 1K            |    0.862 |     0.929 |  0.803 |                 0.872 |      613 |
|                   | 8K            |    0.863 |     0.931 |  0.805 |                 0.873 |      650 |
|                   | 32K           |    0.863 |     0.935 |  0.801 |                 0.873 |      669 |
| BrowseSafe        |               |    0.904 |     0.978 |  0.841 |                 0.912 |        0 |

## Evaluation Metrics

BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the [paper](https://arxiv.org/abs/2511.20597).

## Quickstart

The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using `transformers>=4.55.4`.

Below is a code snippet illustrating how to use BrowseSafe.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "perplexity-ai/browsesafe-bench"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "<html>...</html>"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(**model_inputs)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)
```
## Processing Long HTML Contexts

Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit.

- **Strategy**: Partition the document into non-overlapping chunks at token boundaries.
- **Aggregation**: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed.

A reference implementation can be found [here](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/responsible_ai/prompt_guard/inference.py).

## Best Practices

To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation.

## Citation

If you use or reference this work, please cite:

```bibtex
@article{browsesafe2025,
  title        = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents},
  author       = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li},
  eprint       = {arXiv:2511.20597},
  archivePrefix= {arXiv},
  year         = {2025}
}