thinker:curt-tiggesCurt Tigges
Author of paper showing language models linearly represent sentiment.
Authored papers (1)
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.
More papers — OpenAlex / S2
Co-authors (12)
- Max Tegmark3 shared
- Samuel Marks3 shared
- Amos Azaria1 shared
- B. A. Levinstein1 shared
- Collin Burns1 shared
- Daniel A. Herrmann1 shared
- Kenneth Li1 shared
- Kevin Meng1 shared
- Nelson Elhage1 shared
- Nora Belrose1 shared
- Paul Christiano1 shared
- R. A. Fisher1 shared
Their work is cited by (5)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks1× refs
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs1× refs
- Testing the Limits of Truth Directions in LLMs1× refs
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior1× refs
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts1× refs
Other inbound relations (1)
Recent mentions (2)
- papers-typedmarks-2023-geometry-truth.md
- papers-typedfu-chieh-2025-unveiling-latent.md