finding
active
finding:illusions-vector-at-layer-1-2-origami-vector-at-layer-0-2-and-recursion-vector-at-layer-2-5-each-achieve-100-localization-accuracy-across-50-trialsIllusions vector at layer 1 α=2, Origami vector at layer 0 α=2, and recursion vector at layer 2 α=5 each achieve 100% localization accuracy across 50 trials
Demonstrates concept-specific variation in introspective salience, suggesting some vectors produce more detectable perturbations
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Vector-based navigation using grid-like representations in artificial agents (Banino et al., 2018)concept0.746Demonstrated grid cell emergence in RNNs trained on spatial navigation; related work category 4.
- Striking mechanistic finding that injection creates universally detectable perturbation in residual stream immediately downstream
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.738Experiment 1 finding localizing where truth can be causally mediated
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Key interpretive claim that deception has a tractable geometric signature in activation space
- Sentence localization accuracy reaches 88% at layer 2, α=5 vs. 10% chance in 10-way classificationfinding0.734Highest localization accuracy achieved, showing strong partial introspection for early-layer injections
- Hypothesis raised in distributive law task analysis
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.