claim
active
claim:mds-injections-align-with-the-linear-representation-hypothesis-target-trait-varies-near-linearly-with-alpha-in-open-ended-generationMDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generation
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- 47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)associated_withsupportsDemonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
Frameworks (1)
framework
- Linear Representation HypothesissupportsThe hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Methods (1)
method
- OLS regression fitted to mu(alpha) trends to assess near-linearity of steering with alpha coefficient
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Qualitative finding demonstrating unique capability of activation-level interventions unavailable to prompting methods including PM
- MDS injection steering efficiency peaks at mid-layers across LLMs, injection strides, and OCEAN traitsfinding0.793Consistent empirical pattern supporting the connection between mid-layer representations and emotion/behavioral content
- MDS injections show no salient patterns in MPI-120 inventory responses beyond occasional co-occurring peaksfinding0.790Contrasts with SJT results; leads authors to narrow analyses to SJT responses
- Do the findings about MDS injection effectiveness generalize to base (non-instruction-tuned) language models?question0.779Acknowledged limitation: only instruction-tuned models were studied
- MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%finding0.774Primary quantitative result overturning prior reports that prompting outperforms representation engineering
- Interpretive conclusion from Big Two mismatch finding; tentative due to only 46.15% match rate
- Mechanistic explanation for MDS superiority; attributed to two design choices: centroid alignment and full-utterance semantics in h_s
- Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions