concept

active

concept:representation-engineering-for-large-language-models-survey-and-research-challenges-bartoszcze-et-al-2025

Representation engineering for large-language models: Survey and research challenges (Bartoszcze et al., 2025)

Survey of representation engineering methods cited as related work

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Concept bottleneck large language models (Sun et al., 2025a)concept0.806
Related work designing LLMs to natively support interpretable concept steering
Large Language Models Can Strategically Deceive Their Users When Put Under Pressure (Scheurer et al. 2023)concept0.798
GPT-4 engaging in insider trading and denying it; related work on strategic deception
Representation engineering: A top-down approach to AI transparency (Zou et al., 2023)concept0.794
Key prior work on representation engineering that ReflCtrl directly extends
Representation Engineeringframework0.791
A class of methods that modify how models internally process representations; SOO fine-tuning fits within this framework
LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2022)concept0.787
Fine-tuning method paper whose technique is used in the fine-tuning experiments
We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remainhypothesis0.784
Forward-looking prediction about scalability of the method to larger models
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.783
Foundational paper introducing activation steering methodology used in this work
Language models are some of the most remarkable computer programs in existence.quote0.780
Opening sentence setting the stage for the importance of interpretability.