finding
active
finding:47-69-of-130-injection-manipulated-alpha-trends-have-near-linear-fits-r2-0-95-96-15-have-roughly-linear-fits-r2-0-7547.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)
Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Neighborhood — ranked by edge-count
Claims (1)
claim
- MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generationassociated_withsupportsTheoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Control comparison showing near-linearity is specific to the targeted manipulation direction
- Table 2, row 3, showing equivalence when prior preferences match rewards.
- OLS regression fitted to mu(alpha) trends to assess near-linearity of steering with alpha coefficient
- Core result showing MM is superior to LR for causal implication despite similar classification accuracy
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.756Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
- Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
- Suggests fundamental differences in learning dynamics between normal and chronic perception models
- Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias