Paul Christiano

Cited for RL from human preferences (2017) and debates/discussions.

Authored

Introduces

Studies

Affiliations

Cited by

Authored papers (1)

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets2023ⓒ 17
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.

More papers — OpenAlex / S2

Co-authors (12)

Max Tegmark3 shared
Samuel Marks3 shared
Amos Azaria1 shared
B. A. Levinstein1 shared
Collin Burns1 shared
Curt Tigges1 shared
Daniel A. Herrmann1 shared
Kenneth Li1 shared
Kevin Meng1 shared
Nelson Elhage1 shared
Nora Belrose1 shared
R. A. Fisher1 shared

Their work is cited by (5)

Other inbound relations (2)

citesSimulators (LessWrong post)(artifact)
mentionsAlignment faking in large language models(paper)

Recent mentions (4)

papers-typed
marks-2023-geometry-truth.md
papers-typed
greenblatt-2024-alignment.md
papers
simulators.md
papers-typed
yuntao-2022-cat-s.md