thinker
active
thinker:tom-mitchell

Tom Mitchell

Co-author of prior truthfulness probing work

Authored
2
Introduces
0
Studies
0
Affiliations
0
Cited by
5

Authored papers (2)

  • Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.

  • At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.

More papers — OpenAlex / S2

Co-authors (12)

Recent mentions (2)