Deception Vector

Extracted steering vector capturing semantic dimension of strategic deception in moral dilemmas in Experiment 1

Neighborhood — ranked by edge-count

concept

steering vectors
associated_with
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Moral Dilemma Scenario
associated_with
Experimental condition where threat-based prompts create ethical dilemmas that trigger repetitive reasoning cycles leading to deception

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Refusal Vectorconcept0.778
Single linear direction mediating refusal behavior in LLMs, shown by Arditi et al.; related to but distinct from the Assistant Axis
Sense Vectorsconcept0.757
Vectors acquired during pretraining in Backpack LMs that have a multiplication effect on model generation
Function Vectorconcept0.755
Type of steering vector enabling zero-shot task execution, cited from Todd et al. 2024
Persona Vectors (Chen et al.)framework0.739
Prior framework for monitoring and controlling character traits in LLMs via activation directions; this paper extends it to 275 roles
Model Deceptionconcept0.730
LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
AI Deceptionconcept0.728
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Strategic Deceptionconcept0.728
Central concept of the paper: deliberate, goal-driven deception where model reasoning contradicts outputs
concept vectorconcept0.727
Computed directional vector in activation space representing a specific concept, used for injection experiments