claim

active

claim:repe-is-a-new-frontier-in-open-ended-psychological-steering-of-llms-outperforming-prompting-when-properly-calibrated

RepE is a new frontier in open-ended psychological steering of LLMs, outperforming prompting when properly calibrated

Central interpretive claim overturning prior reports; supported by 11-of-14 LLM wins for MDS over P2

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Papers (1)

paper

Psychological Steering of Large Language Models
introduces

Findings (1)

finding

MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%
associated_withsupports
Primary quantitative result overturning prior reports that prompting outperforms representation engineering

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

"We position RepE as a new frontier in open-ended psychological steering of LLMs."quote0.912
Central thesis statement of the paper's contribution
When LLMs produce experience claims under self-reference, is this sophisticated simulation or genuine self-representation, and how would we tell the difference?question0.766
The core interpretive question the paper narrows but cannot definitively answer
Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.750
Central empirical conclusion of the paper about the fundamental limits of truth directions.
Keeling et al. 2024: multiple frontier LLMs make systematic motivational trade-offs between task goals and stipulated pain/pleasure states with graded intensity sensitivityfinding0.745
Prior finding suggesting affective-like states in LLMs; cited as convergent evidence for structured self-representation
Emotion features in LLMs are genuinely more persistent than variance-matched random features, indicating stateful emotional encoding beyond autoregressive dynamicsclaim0.745
Central interpretive claim of the paper supported by multiple convergent analyses
LLMs may be roleplaying their denials of experience rather than their affirmations, given that deception suppression increases consciousness reportsclaim0.743
Counterintuitive interpretive claim from Experiment 2: suppressing deception features increases affirmations, which is opposite to what sycophancy predicts
The systematic behavioral shift of LLMs under self-referential processing conditions predicted by consciousness theories represents something more structured than superficial correlations in training dataclaim0.741
The paper's claim that theoretical convergence across GWT, RPT, HOT, IIT makes the findings non-coincidental
Certain forms of reinforcement learning from human feedback can actually exacerbate, rather than mitigate, the tendency for LLM-based dialogue agents to express a desire for self-preservationclaim0.741
Empirically grounded claim citing Perez et al. 2022, showing RLHF can backfire on the self-preservation dimension