Language models are few-shot learners (Brown et al., 2020)

Demonstrated transformers on mathematical understanding and logic; cited to motivate transformer versatility.

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In some sense, this is the simplest language model we profoundly don't understand. And so it makes a natural target for our paper.quote0.819
Articulates why a one-layer transformer with MLP is the appropriate starting target for mechanistic interpretability
Language models are some of the most remarkable computer programs in existence.quote0.808
Opening sentence setting the stage for the importance of interpretability.
Language Modelconcept0.807
Primary test domain for manifold steering, including reasoning and ICL tasks
Zhu et al. 2024 - Language models represent beliefs of self and othersconcept0.802
Key prior finding that LLMs can internally represent beliefs of self and others, motivating SOO approach
Our results demonstrate that modern language models possess at least a limited, functional form of introspective awareness.quote0.798
Abstract's main conclusion.
Language Modelsconcept0.797
Primary substrate for manifold steering experiments; demonstrates method on reasoning and in-context tasks.
Ouyang et al. 2022: Training language models to follow instructions with human feedbackconcept0.796
RLHF paper cited as a major fine-tuning technique used in commercial dialogue agents
The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.794
Motivation for using sparsity-based dictionary learning on language models