finding

active

finding:47-69-of-130-injection-manipulated-alpha-trends-have-near-linear-fits-r2-0-95-96-15-have-roughly-linear-fits-r2-0-75

47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)

Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Claims (1)

claim

MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generation
associated_withsupports
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Only 13.27% of 520 non-manipulated alpha trends achieve R2 >= 0.95, contrasting with 47.69% for manipulated trendsfinding0.855
Control comparison showing near-linearity is specific to the targeted manipulation direction
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.763
Table 2, row 3, showing equivalence when prior preferences match rewards.
OLS Linear Regression Fit to Alpha Trendsmethod0.762
OLS regression fitted to mu(alpha) trends to assess near-linearity of steering with alpha coefficient
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.756
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.756
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)finding0.755
Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
Normal (α=0.9) and chronic (α=0.1) agents in Objective-only non-stationary category perform best with opposite learning ratesfinding0.751
Suggests fundamental differences in learning dynamics between normal and chronic perception models
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.750
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias