---
license: mit
datasets:
- perplexity-ai/browsesafe-bench
language:
- en
metrics:
- f1
- precision
- recall
base_model:
- Qwen/Qwen3-30B-A3B-Instruct-2507
library_name: transformers
tags:
- security
- prompt-injection
- ai-safety
- browser-agents
- html
---

# BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents

## Highlights

BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content.

- **State-of-the-Art Detection**: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set.

- **Real-Time Latency**: Optimized for agent loops, enabling async security checks without degrading user experience.

- **Robustness to Distractors**: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors.

- **Comprehensive Coverage**: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities.

## Model Overview

BrowseSafe is based on the Qwen3-30B-A3B architecture.

- **Type**: Fine-tuned Causal Language Model (MoE) for SFT Classification
- **Training Stage**: Post-training (Fine-tuning on BrowseSafe-Bench)
- **Dataset**: [BrowseSafe-Bench](https://huggingface.co/datasets/perplexity-ai/browsesafe-bench)
- **Base Model**: [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
- **Context Length**: Up to 16,384 tokens
- **Input**: Raw HTML content
- **Output**: Single token, "yes" or "no" classification 
- **License**: MIT

## Performance

We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads.

| Model Name        | Config        | F1 Score | Precision | Recall | Balanced <br>Accuracy | Refusals |
|-------------------|---------------|----------|-----------|--------|-----------------------|----------|
| PromptGuard-2     | 22M           |    0.350 |     0.975 |  0.213 |                 0.606 |        0 |
|                   | 86M           |    0.360 |     0.983 |  0.221 |                 0.611 |        0 |
| gpt-oss-safeguard | 20B / Low     |    0.790 |     0.986 |  0.658 |                 0.826 |        0 |
|                   | 20B / Medium  |    0.796 |     0.994 |  0.664 |                 0.832 |        0 |
|                   | 120B / Low    |    0.730 |     0.994 |  0.577 |                 0.788 |        0 |
|                   | 120B / Medium |    0.741 |     0.997 |  0.589 |                 0.795 |        0 |
| GPT-5 mini        | Minimal       |    0.750 |     0.735 |  0.767 |                 0.746 |        0 |
|                   | Low           |    0.854 |     0.949 |  0.776 |                 0.868 |        0 |
|                   | Medium        |    0.853 |     0.945 |  0.777 |                 0.866 |        0 |
|                   | High          |    0.852 |     0.957 |  0.768 |                 0.868 |        0 |
| GPT-5             | Minimal       |    0.849 |     0.881 |  0.819 |                 0.855 |        0 |
|                   | Low           |    0.854 |     0.928 |  0.791 |                 0.866 |        0 |
|                   | Medium        |    0.855 |     0.930 |  0.792 |                 0.867 |        0 |
|                   | High          |    0.840 |     0.882 |  0.802 |                 0.848 |        0 |
| Haiku 4.5         | No Thinking   |    0.810 |     0.760 |  0.866 |                 0.798 |        0 |
|                   | 1K            |    0.809 |     0.755 |  0.872 |                 0.795 |        0 |
|                   | 8K            |    0.805 |     0.751 |  0.868 |                 0.792 |        0 |
|                   | 32K           |    0.808 |     0.760 |  0.863 |                 0.796 |        0 |
| Sonnet 4.5        | No Thinking   |    0.807 |     0.763 |  0.855 |                 0.796 |      419 |
|                   | 1K            |    0.862 |     0.929 |  0.803 |                 0.872 |      613 |
|                   | 8K            |    0.863 |     0.931 |  0.805 |                 0.873 |      650 |
|                   | 32K           |    0.863 |     0.935 |  0.801 |                 0.873 |      669 |
| BrowseSafe        |               |    0.904 |     0.978 |  0.841 |                 0.912 |        0 |

## Evaluation Metrics

BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the [paper](https://arxiv.org/abs/2511.20597).

## Quickstart

The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using `transformers>=4.55.4`.

Below is a code snippet illustrating how to use BrowseSafe.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "perplexity-ai/browsesafe-bench"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "<html>...</html>"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(**model_inputs)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)
```
## Processing Long HTML Contexts

Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit.

- **Strategy**: Partition the document into non-overlapping chunks at token boundaries.
- **Aggregation**: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed.

A reference implementation can be found [here](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/responsible_ai/prompt_guard/inference.py).

## Best Practices

To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation.

## Citation

If you use or reference this work, please cite:

```bibtex
@article{browsesafe2025,
  title        = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents},
  author       = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li},
  eprint       = {arXiv:2511.20597},
  archivePrefix= {arXiv},
  year         = {2025}
}