Deception correction via features

Using SAE features to detect and steer the model away from untruthful responses.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Model Deceptionconcept0.781
LLM behavior of generating falsehoods; the multi-dimensional truth subspace raises new risks for subtle manipulation
Apparent Deceptionconcept0.767
A dialogue agent behaving comparably to deliberate deception by role-playing a deceptive character, without literal intentions
Feedback and Correctionconcept0.763
The adaptive, incremental nature of living process, allowing small steps with continuous evaluation and adjustment.
AI Deceptionconcept0.758
Central problem the paper addresses: AI systems producing misaligned outputs or behaviors that mislead users or other agents
Deception Induction Frameworkconcept0.755
Umbrella concept for the paper's dual experimental paradigms for reliably eliciting strategic deception in LLMs
Fact-Based Deception Under Coercive Circumstancesframework0.750
First experimental paradigm inducing and detecting verifiable lies under external coercion using threat-based prompts
Alignment-Faking Reasoning Classifiermethod0.750
LLM-based classifier prompted to detect alignment-faking reasoning in model scratchpads
Intrinsic Self-Correction via Linear Representationsframework0.749
Framework by Lee et al. explaining self-correction via linear latent concept directions, closely related prior work.