claim
active
claim:representation-engineering-successfully-quantifies-deception-via-high-accuracy-steering-vectors-establishing-it-as-a-measurable-property-of-model-representationsRepresentation engineering successfully quantifies deception via high-accuracy steering vectors, establishing it as a measurable property of model representations
Key interpretive claim that deception has a tractable geometric signature in activation space
Source paper
extracted_from(2025) · Kai Wang · Yihao Zhang · Meng Sun
Neighborhood — ranked by edge-count
Findings (3)
finding
- Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy
- Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
- Shows honesty steering vector can significantly reduce deception in open-role scenarios
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Critical finding showing steering vectors can produce unfaithful CoT where harmful choices are obscured in reasoning
- What if the concept being manipulated does not lie on a straight line in the model's representations?question0.786The motivating question that opens the paper and leads to the development of manifold steering.
- Interpretation of weaker PCA separation and lower ASR in smaller models
- Core applied contribution claim, supported by top-k accuracy comparisons.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Key prior work on representation engineering that ReflCtrl directly extends
- Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.764Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.