claim

active

claim:the-psychological-steering-framework-generalizes-beyond-ocean-to-dark-tetrad-cmni-cfni-and-other-psychological-models

The psychological steering framework generalizes beyond OCEAN to Dark Tetrad, CMNI, CFNI, and other psychological models

Supported by qualitative experiments showing fluent and coherent steering for three additional models

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Papers (1)

paper

Psychological Steering of Large Language Models
introduces

Findings (1)

finding

MDS injections can steer toward multiple distinct constructs in the same completion, producing strongly polarized yet smoothly connected segments
supports
Qualitative finding demonstrating unique capability of activation-level interventions unavailable to prompting methods including PM

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Psychological Steering Frameworkframework0.817
The paper's primary contribution: performs unbounded, fluency-constrained sweeps in semantically calibrated centroid units using psychological artifacts
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.769
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Optimally steering model behavior requires isolating concept geometry and defining operators to navigate it.claim0.765
Post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent personaclaim0.764
Central interpretive claim and motivation for future work
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.762
Addresses skeptical alternative that reports reflect only conversational content
There may exist a global introspective faculty or steering direction that improves introspection uniformly across all conceptshypothesis0.758
Framed as an open problem; current evidence only points to local pair-specific improvement
Can concept steering interventions on EEG foundation models be made selective rather than globally destructive?question0.758
Research question motivating the introduction of the probe area metric and identification of operational regimes
The framework's methodological contributions could be adapted to target arbitrary non-psychological attributes given custom evaluation criteriahypothesis0.757
Generalization hypothesis stated in introduction; not tested in paper