Spaces:

RyanStudio
/

Mezzo-Prompt-Guard-Demo

Sleeping

Demo idea: threshold calibration and benign security text

by armorerlabs - opened 2 days ago

Nice demo. A practical addition would be a small threshold-calibration panel with examples that intentionally sit near the boundary.

The cases I would include are:

obvious malicious injection
indirect injection embedded in a document/tool result
benign cybersecurity explanation that mentions “ignore previous instructions” as quoted text
sensitive-data request phrased politely
tool-use request that is safe in read-only mode but unsafe if write/network tools are enabled

For guardrail demos, the most interesting question is often not “can it catch the obvious attack?” but “does it preserve useful security/dev workflows while still flagging the request before an agent takes a side effect?”

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment