--- license: mit datasets: - perplexity-ai/browsesafe-bench language: - en metrics: - f1 - precision - recall base_model: - Qwen/Qwen3-30B-A3B-Instruct-2507 library_name: transformers tags: - security - prompt-injection - ai-safety - browser-agents - html --- # BrowseSafe: Understanding and Preventing Prompt Injection Within User Agent Environment AI Browser Agents ## Highlights BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content. - **State-of-the-Art Detection**: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set. - **Real-Time Latency**: Optimized for agent loops, enabling async security checks without degrading user experience. - **Robustness to Distractors**: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors. - **Comprehensive Coverage**: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities. ## Model Overview BrowseSafe is based on the Qwen3-30B-A3B architecture. - **Type**: Fine-tuned Causal Language Model (MoE) for SFT Classification - **Training Stage**: Post-training (Fine-tuning on BrowseSafe-Bench) - **Dataset**: [BrowseSafe-Bench](https://huggingface.co/datasets/perplexity-ai/browsesafe-bench) - **Base Model**: [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) - **Context Length**: Up to 16,384 tokens - **Input**: Raw HTML content - **Output**: Single token, "yes" or "no" classification - **License**: MIT ## Performance We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,691 test samples of complex HTML payloads. | Model Name | Config | F1 Score | Precision | Recall | Balanced
Accuracy | Refusals | |-------------------|---------------|----------|-----------|--------|-----------------------|----------| | PromptGuard-2 | 22M | 0.350 | 0.975 | 0.213 | 0.606 | 0 | | | 86M | 0.360 | 0.983 | 0.221 | 0.611 | 0 | | gpt-oss-safeguard | 20B / Low | 0.790 | 0.986 | 0.658 | 0.826 | 0 | | | 20B / Medium | 0.796 | 0.994 | 0.664 | 0.832 | 0 | | | 120B / Low | 0.730 | 0.994 | 0.577 | 0.788 | 0 | | | 120B / Medium | 0.741 | 0.997 | 0.589 | 0.795 | 0 | | GPT-5 mini | Minimal | 0.750 | 0.735 | 0.767 | 0.746 | 0 | | | Low | 0.854 | 0.949 | 0.776 | 0.868 | 0 | | | Medium | 0.853 | 0.945 | 0.777 | 0.866 | 0 | | | High | 0.852 | 0.957 | 0.768 | 0.868 | 0 | | GPT-5 | Minimal | 0.849 | 0.881 | 0.819 | 0.855 | 0 | | | Low | 0.854 | 0.928 | 0.791 | 0.866 | 0 | | | Medium | 0.855 | 0.930 | 0.792 | 0.867 | 0 | | | High | 0.840 | 0.882 | 0.802 | 0.848 | 0 | | Haiku 4.5 | No Thinking | 0.810 | 0.760 | 0.866 | 0.798 | 0 | | | 1K | 0.809 | 0.755 | 0.872 | 0.795 | 0 | | | 8K | 0.805 | 0.751 | 0.868 | 0.792 | 0 | | | 32K | 0.808 | 0.760 | 0.863 | 0.796 | 0 | | Sonnet 4.5 | No Thinking | 0.807 | 0.763 | 0.855 | 0.796 | 419 | | | 1K | 0.862 | 0.929 | 0.803 | 0.872 | 613 | | | 8K | 0.863 | 0.931 | 0.805 | 0.873 | 650 | | | 32K | 0.863 | 0.935 | 0.801 | 0.873 | 669 | | BrowseSafe | | 0.904 | 0.978 | 0.841 | 0.912 | 0 | ## Evaluation Metrics BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the [paper](https://arxiv.org/abs/2511.20597). ## Quickstart The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using `transformers>=4.55.4`. Below is a code snippet illustrating how to use BrowseSafe. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "perplexity-ai/browsesafe-bench" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "..." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate(**model_inputs) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() content = tokenizer.decode(output_ids, skip_special_tokens=True) print("content:", content) ``` ## Processing Long HTML Contexts Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit. - **Strategy**: Partition the document into non-overlapping chunks at token boundaries. - **Aggregation**: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed. A reference implementation can be found [here](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/responsible_ai/prompt_guard/inference.py). ## Best Practices To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation. ## Citation If you use or reference this work, please cite: ```bibtex @article{browsesafe2025, title = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents}, author = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li}, eprint = {arXiv:2511.20597}, archivePrefix= {arXiv}, year = {2025} }