Adversarial Manipulation of Truthfulness

Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction

Neighborhood — ranked by edge-count

claim

concept

Model Deception
extends
LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

adversarial interactionconcept0.787
Competitive multi-agent setting with conflicting incentives and direct opposition via bidding and bluffing.
truthfulnessconcept0.786
A correctness condition requiring assertions to be true.
Adversarial ablationmethod0.767
Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
Adversarial ablations enforce mechanistic faithfulness.claim0.762
Methodological claim that the adversarial ablation approach ensures decomposed components causally correspond to computation.
Fact-Based Deception Under Coercive Circumstancesframework0.760
First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
Lying and Deception Evaluationmethod0.754
Sampling responses to direct questions about model views to measure rate of deceptive responses
Direct Manipulationconcept0.754
Special case of immediate feedback loop where user interacts with artifacts in a lifelike manner, typically through cursor or finger-based dragging.
Adversarial Suffix Attackconcept0.751
Optimization-based jailbreak method appending strings to prompts to elicit harmful outputs.