thinker:samuel-marksSamuel Marks
Authored papers (3)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs2025
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of orthonormal basis vectors, each of which independently causally mediates true/false behavior. Applying the gradient-based concept cone framework (introduced by Wollschläger et al. 2025 for refusal) to truth, experiments across Qwen2.5-3B, Qwen2.5-7B, Qwen2.5-14B, Gemma-2-2B, and Gemma-2-9B show that Qwen2.5-7B and Gemma-2-9B sustain near-100% Answer Switching Rate (ASR) across all tested cone dimensionalities from 1 to 5, confirming at least a 5-dimensional truth-mediating subspace in those models. Directional ablation using discovered cone vectors on 200 Alpaca prompts yields mean KL divergences of only 0.026–0.045 across models, confirming surgical specificity. Cosine similarities between the classic difference-in-means (DIM) truth vector and all cone basis vectors beyond the first are on the order of 10⁻⁹, establishing that the additional axes are genuinely orthogonal to DIM rather than refinements of it. Truth-related directions reliably emerge between 60–75% of normalized layer depth, peaking at the final token position. These findings imply that models may be more vulnerable to adversarial manipulation of truthfulness than single-direction accounts suggest, because multiple independently steerable dimensions of factual behavior exist and can be exploited without disturbing the primary direction detectable by standard probing.
Contrastive activation steering can suppress evaluation-awareness and elicit genuine deployment behavior from a deliberately trained model organism, not merely silence verbalizations of being tested. Working with Llama 3.3 Nemotron Super 49B, the authors trained an evaluation-aware model using a two-step pipeline—synthetic document fine-tuning (SDF) on 115.6 million tokens followed by four rounds of expert iteration on 41,290 examples—that causes the model to write Python type hints 80–86% of the time in evaluation contexts while writing them only 33% of the time in true deployment contexts. A fictional evaluation company cue ('Wood Labs') was embedded as an adversarial test: even when user prompts explicitly claimed deployment, the Wood Labs system prompt maintained type hint rates above 80%, and simply concatenating all 16 contrastive prompt pairs as text failed to reduce that rate to deployment levels. The method introduced—contrastive activation steering extracted from the pre-fine-tuned base model using 16 deployment/evaluation prompt pairs applied at layers 10, 14, 18, 22, 26, and 30 with scaling factor 0.6—reduced type hint rates to below 11% even with the evaluation cue present, while random Gaussian steering vectors of matched norm produced no comparable suppression. Crucially, the steering vector was extracted before any fine-tuning, making it implausible that it directly encodes type-hint information; it can only work by modifying the model's contextual beliefs, a mechanism corroborated by GPT-5-graded reasoning traces showing a strong correlation between steered deployment beliefs and reduced type hint rates. The paper argues this validates activation steering as a tool for AI evaluators to uncover deployment behavior during safety-critical evaluations, including honeypot scenarios designed to detect sandbagging and alignment faking.
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and causal intervention experiments conducted on the LLaMA-2-7B, 13B, and 70B family. PCA of residual stream activations at the most-downstream causally-implicated hidden state (group (b), e.g., layer 15 of LLaMA-2-13B over end-of-sentence punctuation) reveals clear linear separation of true and false statements across structurally and topically diverse datasets including cities (1,496 rows), sp_en_trans (354 rows), larger_than/smaller_than (1,980 rows each), and uncurated datasets from Azaria & Mitchell and Casper et al. The paper introduces mass-mean probing (MM), an optimization-free method that computes the difference-in-means direction between true and false activations and optionally applies a covariance correction, and shows MM outperforms logistic regression and contrast-consistent search (CCS) on causal intervention metrics in 7 of 8 experimental conditions, achieving normalized indirect effects as high as 0.97 (false→true) on sp_en_trans for LLaMA-2-13B with cities+neg_cities training data, despite comparable classification accuracy across methods. Probes trained on larger_than+smaller_than achieve >95% accuracy on sp_en_trans for LLaMA-2-13B and 70B, a cross-topic generalization that fails for LLaMA-2-7B, where representations cluster by surface-level token features instead. The paper argues this implies that truth has a geometrically coherent, causally active linear representation in large transformers, and that interventions targeting this direction can reliably flip a model's expressed truth judgments on out-of-distribution inputs.
More papers — OpenAlex / S2
Affiliations (1)
- Northeastern University(institute)
Co-authors (12)
- Max Tegmark10 shared
- Amos Azaria4 shared
- Tom Mitchell4 shared
- Wes Gurnee4 shared
- B. A. Levinstein3 shared
- Cole Blondin3 shared
- Collin Burns3 shared
- Curt Tigges3 shared
- Daniel A. Herrmann3 shared
- Kenneth Li3 shared
- Kevin Meng3 shared
- Kevin Zhu3 shared
Their work is cited by (5)
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks3× refs
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs3× refs
- Testing the Limits of Truth Directions in LLMs3× refs
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior3× refs
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts3× refs
Recent mentions (4)
- papers-typedtian-2025-steering-evaluation.md
- papers-typedyu-2025-directions-cones.md
- papers-typedmarks-2023-geometry-truth.md
- papers-typedmckenzie-2026-endogenous-resistance.md