concept
active
concept:deception-vectorDeception Vector
Extracted steering vector capturing semantic dimension of strategic deception in moral dilemmas in Experiment 1
Neighborhood — ranked by edge-count
Concepts (2)
concept
- steering vectorsassociated_withA method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Moral Dilemma Scenarioassociated_withExperimental condition where threat-based prompts create ethical dilemmas that trigger repetitive reasoning cycles leading to deception
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Single linear direction mediating refusal behavior in LLMs, shown by Arditi et al.; related to but distinct from the Assistant Axis
- Vectors acquired during pretraining in Backpack LMs that have a multiplication effect on model generation
- Type of steering vector enabling zero-shot task execution, cited from Todd et al. 2024
- Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
- LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
- Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
- Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
- Computed directional vector in activation space representing a specific concept, used for injection experiments