Representation engineering: A top-down approach to AI transparency

ByA. Zou·L. Phan·S. Chen·J. Campbell·P. Guo·R. Ren+4 more

DOI 10.48550/arxiv.2310.01405 arXiv 2310.01405

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Towards AI Transparency and Accountability: A Global Framework for Exchanging Information on AI Systems
Adrian Byrne, Nicholas Perello, Cyrus Cousins, Taha Yasseri, Yair Zick, Przemyslaw Grabowicz Warren Buckley
2026
≈ 81%
Engineering.ai: A Platform for Teams of AI Engineers in Computational Design
Yupeng Qi, Jingsen Feng, Xu Chu Ran Xu
2025
≈ 79%
Why Representation Engineering Works: A Theoretical and Empirical Study in Vision-Language Models
Xuntao Lyu, Meng Liu, Hongyi Wang, Ang Li Bowei Tian
2025
≈ 79%
Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models
Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz Jan Wehner
2025
≈ 78%
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
Serhii Zabolotnii
2026
≈ 77%
Context Engineering: From Prompts to Corporate Multi-Agent Architecture
Vera V. Vishnyakova
2026
≈ 77%
A Timeline and Analysis for Representation Plasticity in Large Language Models
Akshat Kannan
2024
≈ 76%
A call for embodied AI
Jonas Gonzalez-Billandon, Bal\'azs K\'egl Giuseppe Paolo
2024
≈ 76%
Agentic AI in Engineering and Manufacturing: Industry Perspectives on Utility, Adoption, Challenges, and Opportunities
Maxwell Bauer, Claire Jacquillat, A. John Hart, Faez Ahmed Kristen M. Edwards
2026
≈ 76%
Socio-technical aspects of Agentic AI
Alaa Saleh, Ying Li, Shubham Vaishnav, Kai Fang, Hailin Feng, Yuchao Xia, Thippa Reddy Gadekallu, Qiyang Zhang, Xiaodan Shi, Ali Beikmohammadi, Sindri Magn\'usson, Ilir Murturi, Chinmaya Kumar Dehury, Marcin Paprzycki, Lauri Loven, Sasu Tarkoma, Schahram Dustdar Praveen Kumar Donta
2026
≈ 76%
A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
Ross Gore, Peter Foytik, Sachin Shetty, Ravi Mukkamala, Abdul Rahman, Xueping Liang, Safdar H. Bouk, Amin Hass, Sachini Rajapakse, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan Eranga Bandara
2025
≈ 76%
Artificial Intelligence for Collective Intelligence: A National-Scale Research Strategy
Nirav Ajmeri (1), Mike Batty (2), Michaela Black (3), John Cartlidge (1), Robert Challen (1), Cangxiong Chen (4), Jing Chen (5), Joan Condell (3), Leon Danon (1), Adam Dennett (2), Alison Heppenstall (6), Paul Marshall (1), Phil Morgan (5), Aisling O'Kane (1), Laura G. E. Smith (4), Theresa Smith (4), Hywel T. P. Williams (7) ((1) University of Bristol, (2) University College London, (3) Ulster University, (4) University of Bath, (5) Cardiff University, (6) University of Glasgow, (7) University of Exeter) Seth Bullock (1)
2024
≈ 76%
From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agentic AI-Mediated Software Engineering
Bhada Yun, April Yi Wang Dana Feng
2026
≈ 75%
Describing Agentic AI Systems with C4: Lessons from Industry Projects
Stefan Wittek Andreas Rausch
2026
≈ 75%
Automotive Engineering-Centric Agentic AI Workflow Framework
Zhihao Liu, Piero Brigida, Yerlan Akhmetov, Gurudevan Devarajan, Kai Liu, Ajinkya Bhave Tong Duy Son
2026
≈ 75%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 71%
Taking AI Welfare Seriously
in corpus
2024
≈ 70%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 70%
Cognitive glues are shared models of relative scarcities: the economics of collective intelligence
in corpus
2026
≈ 70%
CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence
in corpus
2022
≈ 70%
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
in corpus
≈ 70%
Interpreting Language Model Parameters
in corpus
2026
≈ 70%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 70%
Towards a theory of conceptual design for software
in corpus
2015
≈ 69%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 69%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 69%
Janus Information Flow Transformers 2025
in corpus
≈ 69%
Collective intelligence: A unifying concept for integrating biology across scales and substrates
in corpus
2024
≈ 69%

Similar preprints — Semantic Scholar

Cited by (5)

Endogenous Resistance to Activation Steering in Language Models
ReflCtrl: Controlling LLM Reflection via Representation Engineering
ReflCtrl demonstrates that self-reflection in reasoning LLMs is governed by an identifiable direction in latent representation space and that suppressing this direction via stepwise steering can reduc
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Self-Other Overlap (SOO) fine-tuning, a method that minimizes the Mean Squared Error between a model's internal activations when processing self-referencing versus other-referencing inputs, reduces de
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
Probe-based data attribution, introduced here as a method for surfacing and mitigating undesirable post-training behaviors, reduces harmful compliance in OLMo 2 7B by 63% through datapoint filtering a