concept
active
concept:adversarial-suffix-attackAdversarial Suffix Attack
Optimization-based jailbreak method appending strings to prompts to elicit harmful outputs.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Jailbreak Attackassociated_withSecurity attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Competitive multi-agent setting with conflicting incentives and direct opposition via bidding and bluffing.
- Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
- Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
- Procedure in VPD that actively searches for combinations that break the prediction of which subcomponents are unimportant, stress-testing the decomposition.
- Methodological claim that the adversarial ablation approach ensures decomposed components causally correspond to computation.
- mean of bid divided by quartet value of auctioned animal
- Changing configuration to sample environment differently; minimizes free energy.