AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models Paper • 2506.14682 • Published Jun 17, 2025
PentestJudge: Judging Agent Behavior Against Operational Requirements Paper • 2508.02921 • Published Aug 4, 2025