Adversarial Suffix Attack

Optimization-based jailbreak method appending strings to prompts to elicit harmful outputs.

Neighborhood — ranked by edge-count

paper

concept

Jailbreak Attack
associated_with
Security attack that bypasses LLM safety alignment by suppressing deliberation or exploiting reflection inhibition.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

adversarial interactionconcept0.779
Competitive multi-agent setting with conflicting incentives and direct opposition via bidding and bluffing.
Adversarial ablationmethod0.771
Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
Adversarial Manipulation of Truthfulnessconcept0.751
Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
Adversarial search for causally unimportant subcomponentsmethod0.706
Procedure in VPD that actively searches for combinations that break the prediction of which subcomponents are unimportant, stress-testing the decomposition.
Adversarial ablations enforce mechanistic faithfulness.claim0.705
Methodological claim that the adversarial ablation approach ensures decomposed components causally correspond to computation.
Ingressing Mindconcept0.690
bid aggressivenessmethod0.687
mean of bid divided by quartet value of auctioned animal
actionconcept0.683
Changing configuration to sample environment differently; minimizes free energy.