thinker:kenneth-liKenneth Li
Author of Inference-Time Intervention (ITI) paper using linear probes; cited for probe-based steering method
Authored papers (1)
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.
More papers — OpenAlex / S2
Studies (1)
Co-authors (12)
- Max Tegmark3 shared
- Samuel Marks3 shared
- Amos Azaria1 shared
- B. A. Levinstein1 shared
- Collin Burns1 shared
- Curt Tigges1 shared
- Daniel A. Herrmann1 shared
- Kevin Meng1 shared
- Nelson Elhage1 shared
- Nora Belrose1 shared
- Paul Christiano1 shared
- R. A. Fisher1 shared
Their work is cited by (5)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks1× refs
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs1× refs
- Testing the Limits of Truth Directions in LLMs1× refs
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior1× refs
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts1× refs
Other inbound relations (1)
- citesPsychological Steering of Large Language Models(paper)
Recent mentions (4)
- papers-typedwu-2024-pyvene-library.md
- papers-typedmarks-2023-geometry-truth.md
- papers-typedmckenzie-2026-endogenous-resistance.md
- papers-typedblas-2026-psychological.md