claim

active

claim:mds-injections-align-with-the-linear-representation-hypothesis-target-trait-varies-near-linearly-with-alpha-in-open-ended-generation

MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generation

Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Papers (1)

paper

Psychological Steering of Large Language Models
introduces

Findings (1)

finding

47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)
associated_withsupports
Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha

Frameworks (1)

framework

Linear Representation Hypothesis
supports
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Methods (1)

method

OLS Linear Regression Fit to Alpha Trends
supports
OLS regression fitted to mu(alpha) trends to assess near-linearity of steering with alpha coefficient

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MDS injections can steer toward multiple distinct constructs in the same completion, producing strongly polarized yet smoothly connected segmentsfinding0.812
Qualitative finding demonstrating unique capability of activation-level interventions unavailable to prompting methods including PM
MDS injection steering efficiency peaks at mid-layers across LLMs, injection strides, and OCEAN traitsfinding0.793
Consistent empirical pattern supporting the connection between mid-layer representations and emotion/behavioral content
MDS injections show no salient patterns in MPI-120 inventory responses beyond occasional co-occurring peaksfinding0.790
Contrasts with SJT results; leads authors to narrow analyses to SJT responses
Do the findings about MDS injection effectiveness generalize to base (non-instruction-tuned) language models?question0.779
Acknowledged limitation: only instruction-tuned models were studied
MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%finding0.774
Primary quantitative result overturning prior reports that prompting outperforms representation engineering
OCEAN MDS injection covariance patterns departing from the Big Two model suggest a gap between learned LLM representations and human psychologyclaim0.770
Interpretive conclusion from Big Two mismatch finding; tentative due to only 46.15% match rate
MD vectors outperform probe-based vectors in SJTs because they align with construct centroids rather than distorting direction via regularizationclaim0.755
Mechanistic explanation for MDS superiority; attributed to two design choices: centroid alignment and full-utterance semantics in h_s
Uncalibrated sweep units and restricted coefficient ranges are the primary cause of prior reports showing P2 outperforming MD injectionsclaim0.755
Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions