claim

active

claim:logistic-regression-fails-to-identify-the-true-feature-direction-when-a-confounding-feature-is-non-orthogonal-to-the-truth-direction-converging-instead-to-the-maximum-margin-separator

Logistic regression fails to identify the true feature direction when a confounding feature is non-orthogonal to the truth direction, converging instead to the maximum margin separator

Motivates the introduction of mass-mean probing as an alternative to LR

Source paper

extracted_from

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

(2023) · Samuel Marks · Max Tegmark

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
introducessupports

Findings (1)

finding

MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probes
supports
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans

Frameworks (2)

framework

Superposition Hypothesis
supports
Core theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
Mass-Mean Probing
supports
Introduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear truth directions in LLMs are reliable primarily in factual recall cases and break down when truth assessment depends on computing and storing intermediate results.claim0.782
Central empirical conclusion of the paper about the fundamental limits of truth directions.
The two-dimensional subspace reported by Burger et al. (2024) seems to reflect a stage of transition in the model's processing, rather than a universal property of truth directions.quote0.767
Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.
Multiple semantically adjacent truth directions make models more vulnerable to manipulations that shift outputs without obvious signs in the primary truth directionclaim0.761
Safety implication derived from multi-dimensional truth structure finding
can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.756
Question about practical safety application of feature monitoring.
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.752
Explanation for why dictionary learning can recover many more features than dimensions.
Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.752
Experiment 1 finding localizing where truth can be causally mediated
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.751
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.750
Motivating hypothesis for Section 5's investigation of prompt template effects.