Concept bottleneck large language models (Sun et al., 2025a)

Related work designing LLMs to natively support interpretable concept steering

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Representation engineering for large-language models: Survey and research challenges (Bartoszcze et al., 2025)concept0.806
Survey of representation engineering methods cited as related work
Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.793
GPT-4 engaging in insider trading and denying it; related work on strategic deception
Shanahan 2023: Talking about large language modelsconcept0.789
Prior paper by Shanahan cautioning against anthropomorphic terms for LLMs; cited as ref 1
Large language models develop surprisingly coherent yet often rigid internal preferences as they scalefinding0.781
Mazeika et al. finding reinforcing the need for emptiness-based flexible value architectures
Large Language Models (LLMs)concept0.781
Transformer-based models like GPT-4, LaMDA, PaLM; assessed for GWT indicators.
Language models are some of the most remarkable computer programs in existence.quote0.779
Opening sentence setting the stage for the importance of interpretability.
Language models are few-shot learners (Brown et al., 2020)concept0.776
Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.
Emotion Concepts and their Function in a Large Language Modelconcept0.769
The prior Anthropic paper whose findings about emotion features in Claude this paper builds upon and extends