thinker
active
thinker:paul-christiano

Paul Christiano

Cited for RL from human preferences (2017) and debates/discussions.

Authored
1
Introduces
0
Studies
0
Affiliations
0
Cited by
5

Authored papers (1)

  • At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.

More papers — OpenAlex / S2

Co-authors (12)

Other inbound relations (2)

Recent mentions (4)