Papers
arxiv:2511.21448

The Phish, The Spam, and The Valid: Generating Feature-Rich Emails for Benchmarking LLMs

Published on Mar 20
Authors:
,
,

Abstract

A metadata-enriched generation framework produces diverse email variants with structured labels and attacker intent annotations for evaluating email security systems.

In this paper, we introduce a metadata-enriched generation framework (PhishFuzzer) that seeds real emails into Large Language Models (LLMs) to produce 23,100 diverse, structurally consistent email variants across controlled entity and length dimensions. Unlike prior corpora, our dataset features strict three-class labels (Phishing, Spam, Valid), provides full URL and attachment metadata, and annotates each email with attacker intent. Using this dataset, we benchmark two state-of-the-art LLMs (Qwen-2.5-72B and Gemini-3.1-Pro) under both Basic (body, subject) and Full (+URL, sender, attachment) settings. By applying formal confidence metrics (Task Success Rate and Confidence Index), we analyze model reliability, robustness against linguistic fuzzing, and the impact of structural metadata on detection accuracy. Our fully open-source framework and dataset provide a rigorous foundation for evaluating next-generation email security systems. To support open science, we make the PhishFuzzer Dataset, the generation scripts and prompts available on GitHub: https://github.com/DataPhish/PhishFuzzer

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2511.21448
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.21448 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.21448 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.