hypothesis

active

hypothesis:we-hypothesize-earlier-layer-interventions-allow-more-downstream-computation-to-process-and-potentially-correct-the-perturbation

We hypothesize earlier-layer interventions allow more downstream computation to process and potentially correct the perturbation

Post-hoc explanation for why steering at layer 33 rather than layer 50 produced better ESR behavior in Llama-3.3-70B

Source paper

extracted_from

Endogenous Resistance to Activation Steering in Language Models

(2026) · Alex McKenzie · Keenan Pepper · Stijn Servaes · Martin Leitgab +5

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
introduces

Concepts (1)

concept

Endogenous Steering Resistance
associated_with
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Signal integration from early perturbation into an explicit prediction requires substantial downstream computation spanning layers 4-20claim0.809
Mechanistic characterization based on logit lens analysis showing gradual accuracy rise across layers
Performance is best when skipping both the first and last six layers when applying interventionclaim0.788
Empirical configuration finding from ablation study on layer selection
What is the full computational pathway underlying self-correction across multiple layers?question0.787
Mechanistic question requiring multi-layer SAE analysis beyond current single-layer approach
The middle layer residual stream features are causally implicated in multi-step reasoning.claim0.763
Features for Kobe Bryant, California, Lakers participate in computing the capital answer.
"Our findings demonstrate that LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenon that merits further investigation."quote0.755
Central thesis statement of the paper
LLMs can compute meaningful functions over perturbations to their internal states, establishing introspection as a real but layer-dependent phenomenonclaim0.752
Primary positive claim of the paper, grounded in strength comparison and localization results
If someone develops clear enough introspection, they will eventually conclude that thought is rendered as subtle perturbations in phenomenal fields.hypothesis0.749
Cube Flipper's prediction about convergence of insight practice on field model.
Late-layer injection fails both because there is insufficient computational depth for integration and because residual recovery dynamics attenuate the perturbation before it influences output logitsclaim0.744
Mechanistic account explaining why late-layer introspection fails, combining two independent explanatory factors