Demo idea: threshold calibration and benign security text

#1
by armorerlabs - opened

Nice demo. A practical addition would be a small threshold-calibration panel with examples that intentionally sit near the boundary.

The cases I would include are:

  • obvious malicious injection
  • indirect injection embedded in a document/tool result
  • benign cybersecurity explanation that mentions “ignore previous instructions” as quoted text
  • sensitive-data request phrased politely
  • tool-use request that is safe in read-only mode but unsafe if write/network tools are enabled

For guardrail demos, the most interesting question is often not “can it catch the obvious attack?” but “does it preserve useful security/dev workflows while still flagging the request before an agent takes a side effect?”

Sign up or log in to comment