question

active

question:do-the-findings-about-mds-injection-effectiveness-generalize-to-base-non-instruction-tuned-language-models

Do the findings about MDS injection effectiveness generalize to base (non-instruction-tuned) language models?

Acknowledged limitation: only instruction-tuned models were studied

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Papers (1)

paper

Psychological Steering of Large Language Models
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generationclaim0.779
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
DAS finds causal effect at all training timesteps including when model is just initialisedfinding0.773
Corroborates Wu et al. 2023 finding that DAS expressivity inflates causal effect estimates
MDS injections show no salient patterns in MPI-120 inventory responses beyond occasional co-occurring peaksfinding0.765
Contrasts with SJT results; leads authors to narrow analyses to SJT responses
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.764
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.761
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.758
Key limitation acknowledged by authors.
OCEAN MDS injection covariance patterns departing from the Big Two model suggest a gap between learned LLM representations and human psychologyclaim0.750
Interpretive conclusion from Big Two mismatch finding; tentative due to only 46.15% match rate
It's tricky, because for a typical language model the entity is sort of tricameral: the base simulator, the simulated simulator, and the simulated awareness.quote0.748
Antra's earlier definitive statement of the tricameral model.