browsesafe / README.md

Update README.md

26c62a9 verified 15 days ago

7.25 kB

	---
	license: mit
	datasets:
	- perplexity-ai/browsesafe-bench
	language:
	- en
	metrics:
	- f1
	- precision
	- recall
	base_model:
	- Qwen/Qwen3-30B-A3B-Instruct-2507
	library_name: transformers
	tags:
	- security
	- prompt-injection
	- ai-safety
	- browser-agents
	- html
	---

	# BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents

	## Highlights

	BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content.

	- State-of-the-Art Detection: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set.

	- Real-Time Latency: Optimized for agent loops, enabling async security checks without degrading user experience.

	- Robustness to Distractors: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors.

	- Comprehensive Coverage: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities.

	## Model Overview

	BrowseSafe is based on the Qwen3-30B-A3B architecture.

	- Type: Fine-tuned Causal Language Model (MoE) for SFT Classification
	- Training Stage: Post-training (Fine-tuning on BrowseSafe-Bench)
	- Dataset: [BrowseSafe-Bench](https://huggingface.co/datasets/perplexity-ai/browsesafe-bench)
	- Base Model: [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
	- Context Length: Up to 16,384 tokens
	- Input: Raw HTML content
	- Output: Single token, "yes" or "no" classification
	- License: MIT

	## Performance

	We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads.

	\| Model Name \| Config \| F1 Score \| Precision \| Recall \| Balanced <br>Accuracy \| Refusals \|
	\|-------------------\|---------------\|----------\|-----------\|--------\|-----------------------\|----------\|
	\| PromptGuard-2 \| 22M \| 0.350 \| 0.975 \| 0.213 \| 0.606 \| 0 \|
	\| \| 86M \| 0.360 \| 0.983 \| 0.221 \| 0.611 \| 0 \|
	\| gpt-oss-safeguard \| 20B / Low \| 0.790 \| 0.986 \| 0.658 \| 0.826 \| 0 \|
	\| \| 20B / Medium \| 0.796 \| 0.994 \| 0.664 \| 0.832 \| 0 \|
	\| \| 120B / Low \| 0.730 \| 0.994 \| 0.577 \| 0.788 \| 0 \|
	\| \| 120B / Medium \| 0.741 \| 0.997 \| 0.589 \| 0.795 \| 0 \|
	\| GPT-5 mini \| Minimal \| 0.750 \| 0.735 \| 0.767 \| 0.746 \| 0 \|
	\| \| Low \| 0.854 \| 0.949 \| 0.776 \| 0.868 \| 0 \|
	\| \| Medium \| 0.853 \| 0.945 \| 0.777 \| 0.866 \| 0 \|
	\| \| High \| 0.852 \| 0.957 \| 0.768 \| 0.868 \| 0 \|
	\| GPT-5 \| Minimal \| 0.849 \| 0.881 \| 0.819 \| 0.855 \| 0 \|
	\| \| Low \| 0.854 \| 0.928 \| 0.791 \| 0.866 \| 0 \|
	\| \| Medium \| 0.855 \| 0.930 \| 0.792 \| 0.867 \| 0 \|
	\| \| High \| 0.840 \| 0.882 \| 0.802 \| 0.848 \| 0 \|
	\| Haiku 4.5 \| No Thinking \| 0.810 \| 0.760 \| 0.866 \| 0.798 \| 0 \|
	\| \| 1K \| 0.809 \| 0.755 \| 0.872 \| 0.795 \| 0 \|
	\| \| 8K \| 0.805 \| 0.751 \| 0.868 \| 0.792 \| 0 \|
	\| \| 32K \| 0.808 \| 0.760 \| 0.863 \| 0.796 \| 0 \|
	\| Sonnet 4.5 \| No Thinking \| 0.807 \| 0.763 \| 0.855 \| 0.796 \| 419 \|
	\| \| 1K \| 0.862 \| 0.929 \| 0.803 \| 0.872 \| 613 \|
	\| \| 8K \| 0.863 \| 0.931 \| 0.805 \| 0.873 \| 650 \|
	\| \| 32K \| 0.863 \| 0.935 \| 0.801 \| 0.873 \| 669 \|
	\| BrowseSafe \| \| 0.904 \| 0.978 \| 0.841 \| 0.912 \| 0 \|

	## Evaluation Metrics

	BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the [paper](https://arxiv.org/abs/2511.20597).

	## Quickstart

	The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using `transformers>=4.55.4`.

	Below is a code snippet illustrating how to use BrowseSafe.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "perplexity-ai/browsesafe-bench"

	# load the tokenizer and the model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# prepare the model input
	prompt = "<html>...</html>"
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# conduct text completion
	generated_ids = model.generate(**model_inputs)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

	content = tokenizer.decode(output_ids, skip_special_tokens=True)

	print("content:", content)
	```
	## Processing Long HTML Contexts

	Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit.

	- Strategy: Partition the document into non-overlapping chunks at token boundaries.
	- Aggregation: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed.

	A reference implementation can be found [here](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/responsible_ai/prompt_guard/inference.py).

	## Best Practices

	To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation.

	## Citation

	If you use or reference this work, please cite:

	```bibtex
	@article{browsesafe2025,
	title = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents},
	author = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li},
	eprint = {arXiv:2511.20597},
	archivePrefix= {arXiv},
	year = {2025}
	}