Persona vectors: Monitoring and controlling character traits in language models

ByRunjin Chen·Andy Arditi·Henry Sleight·Owain Evans·Jack Lindsey

DOI 10.48550/arxiv.2507.21509 arXiv 2507.21509

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Persona Vectors in Games: Measuring and Steering Strategies via Activation Vectors
Andrew Zhang Johnathan Sun
2026
≈ 78%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 75%
Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Wenqiu Tang and Zhen Wan and Takahiro Komamizu and Ichiro Ide
2026
≈ 75%
Where is the Mind? Persona Vectors and LLM Individuation
Pierre Beckmann and Patrick Butlin
2026
≈ 75%
Creating user stereotypes for persona development from qualitative data through semi-automatic subspace clustering
Thomas Bjorner, Pernille Krog Sorensen, Paolo Burelli Dannie Korsgaard
2026
≈ 75%
BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation
Jui-I Wang, Li-Chun Lu, Shao-Hua Sun, Hung-Yi Lee, Kai-Wei Chang Tsung-Min Pai
2026
≈ 74%
Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
Henning Bartsch, Nathan Lambert, Evan Hubinger Sharan Maiya
2025
≈ 74%
The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents
Yihong Tang, Xuefeng Bai, Kehai Chen, Juntao Li, Min Zhang Yifan Duan
2025
≈ 73%
The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation
Katharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo
2026
≈ 72%
Controllable and explainable personality sliders for LLMs at inference time
David Khachaturov, Robert Mullins, Mark Huasong Meng Florian Hoppe
2026
≈ 72%
Evaluating Large Language Models with Psychometrics
Yue Huang, Hongyi Wang, Ying Cheng, Xiangliang Zhang, James Zou, Lichao Sun Yuan Li
2025
≈ 71%
Steering at the Source: Style Modulation Heads for Robust Persona Control
Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura Yoshihiro Izawa
2026
≈ 71%
Persona-Model Collapse in Emergent Misalignment
Renato Vicente Davi Bastos Costa
2026
≈ 71%
What Can We Actually Steer? A Multi-Behavior Study of Activation Control
Krystian Novak Tetiana Bas
2026
≈ 71%
Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Fan Yang, Shaunak A. Mehta, Koichi Onoue Wenkai Li
2026
≈ 71%
"You tell me": A Dataset of GPT-4-Based Behaviour Change Support Conversations
Selina Meyer and David Elsweiler
2026
≈ 71%
Psychological Steering of Large Language Models
in corpus
2026
≈ 69%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 68%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 68%
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 68%
Persistence and Introspection of Emotion Features
in corpus
≈ 67%
Interpreting Language Model Parameters
in corpus
2026
≈ 67%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 67%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 67%
Anima Labs Phenomenology Pt1
in corpus
≈ 66%
Steering Along Manifolds to Control Neural Networks
in corpus
≈ 66%
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
in corpus
2024
≈ 65%
Koan Battery: Measuring Reflective Mode Accessibility in AI
in corpus
2026
≈ 65%

Similar preprints — Semantic Scholar

Cited by (3)

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
Sparse Autoencoder (SAE)-based contrastive feature retrieval can reliably identify and bidirectionally steer high-order semantic features in LLMs, outperforming Contrastive Activation Addition (CAA) i
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a