question

active

question:why-do-mds-injections-fail-on-gemma-3-1b-it-but-succeed-across-all-other-tested-llms

Why do MDS injections fail on gemma-3-1b-it but succeed across all other tested LLMs?

Unexplained exception identified as a limitation and open question

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Papers (1)

paper

Psychological Steering of Large Language Models
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

gemma-3-1b-it yields only one valid MDS injection score (phi_1,A,up = 4.8) and is excluded from main analysesfinding0.814
Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
Why do MDS injections outperform other methods on the inventory (multiple-choice) task?question0.762
Identified as an unexplained result and open question in limitations section
MDS injections show no salient patterns in MPI-120 inventory responses beyond occasional co-occurring peaksfinding0.731
Contrasts with SJT results; leads authors to narrow analyses to SJT responses
MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%finding0.727
Primary quantitative result overturning prior reports that prompting outperforms representation engineering
MDS injection steering efficiency peaks at mid-layers across LLMs, injection strides, and OCEAN traitsfinding0.720
Consistent empirical pattern supporting the connection between mid-layer representations and emotion/behavioral content
MDS injections can steer toward multiple distinct constructs in the same completion, producing strongly polarized yet smoothly connected segmentsfinding0.719
Qualitative finding demonstrating unique capability of activation-level interventions unavailable to prompting methods including PM
MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generationclaim0.717
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
Toxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MASfinding0.708
Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.