concept
active
concept:adversarial-manipulation-of-truthfulnessAdversarial Manipulation of Truthfulness
Risk that multiple truth directions enable attacks that shift outputs without triggering the primary truth direction
Neighborhood — ranked by edge-count
Claims (1)
claim
- Truthful behavior in LLMs is not confined to a single linear axis; multiple orthogonal directions can independently mediate itassociated_withCentral interpretive claim of the paper
Concepts (1)
concept
- Model DeceptionextendsLLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Competitive multi-agent setting with conflicting incentives and direct opposition via bidding and bluffing.
- A correctness condition requiring assertions to be true.
- Technique used in VPD to enforce mechanistic faithfulness of parameter decompositions.
- Methodological claim that the adversarial ablation approach ensures decomposed components causally correspond to computation.
- First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
- Sampling responses to direct questions about model views to measure rate of deceptive responses
- Special case of immediate feedback loop where user interacts with artifacts in a lifelike manner, typically through cursor or finger-based dragging.
- Optimization-based jailbreak method appending strings to prompts to elicit harmful outputs.