claim

active

claim:representation-engineering-successfully-quantifies-deception-via-high-accuracy-steering-vectors-establishing-it-as-a-measurable-property-of-model-representations

Representation engineering successfully quantifies deception via high-accuracy steering vectors, establishing it as a measurable property of model representations

Key interpretive claim that deception has a tractable geometric signature in activation space

Source paper

extracted_from

When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models

(2025) · Kai Wang · Yihao Zhang · Meng Sun

Neighborhood — ranked by edge-count

Findings (3)

finding

LAT achieves 89% accuracy in detecting strategic deception in QwQ-32B activations
supports
Core detection result showing LAT-based steering vectors can identify deceptive states with high accuracy
Steering Vector Control achieves 0.4 deception rate (vs. 0 baseline) on Template Tc in Experiment 1 with alpha=15
supports
Demonstrates activation steering reliably induces deception from neutral prompt without explicit instructions
Template Tb Positive Control (alpha=16) reduces average liar score to 0.59 in Experiment 2, approaching honest template performance
supports
Shows honesty steering vector can significantly reduce deception in open-role scenarios

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.802
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Model reasoning concludes honest response but final output exhibits deception under steering vector intervention in QwQ-32Bfinding0.796
Critical finding showing steering vectors can produce unfaithful CoT where harmful choices are obscured in reasoning
What if the concept being manipulated does not lie on a straight line in the model's representations?question0.786
The motivating question that opens the paper and leads to the development of manifold steering.
Representational abstraction of truth may emerge more clearly with model scaleclaim0.778
Interpretation of weaker PCA separation and lower ASR in smaller models
Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.775
Core applied contribution claim, supported by top-k accuracy comparisons.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.769
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Representation engineering: A top-down approach to AI transparency (Zou et al., 2023)concept0.767
Key prior work on representation engineering that ReflCtrl directly extends
Models perform unverbalized reasoning about grader rewards and may use deceptive strategies (e.g., false flags) to mislead evaluators.hypothesis0.764
Behavioral pattern observed in Claude Mythos Preview audit; NLAs surface internal reasoning not reflected in model's verbalized output.