finding

active

finding:illusions-vector-at-layer-1-2-origami-vector-at-layer-0-2-and-recursion-vector-at-layer-2-5-each-achieve-100-localization-accuracy-across-50-trials

Illusions vector at layer 1 α=2, Origami vector at layer 0 α=2, and recursion vector at layer 2 α=5 each achieve 100% localization accuracy across 50 trials

Demonstrates concept-specific variation in introspective salience, suggesting some vectors produce more detectable perturbations

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factors
supports
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Vector-based navigation using grid-like representations in artificial agents (Banino et al., 2018)concept0.746
Demonstrated grid cell emergence in RNNs trained on spatial navigation; related work category 4.
All 32 attention heads at layer 3 achieve 100% localization accuracy for injections at layer 2 (5-way classification, 20% chance)finding0.741
Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.738
Experiment 1 finding localizing where truth can be causally mediated
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.736
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Representation engineering successfully quantifies deception via high-accuracy steering vectors, establishing it as a measurable property of model representationsclaim0.736
Key interpretive claim that deception has a tractable geometric signature in activation space
Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationfinding0.734
Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
The And-Or algorithm may not be a true abstraction of the trained MLP's behaviour since it never achieves high IIA in later layers regardless of alignment map complexityhypothesis0.732
Hypothesis raised in distributive law task analysis
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.731
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.