claim
active
claim:logistic-regression-fails-to-identify-the-true-feature-direction-when-a-confounding-feature-is-non-orthogonal-to-the-truth-direction-converging-instead-to-the-maximum-margin-separatorLogistic regression fails to identify the true feature direction when a confounding feature is non-orthogonal to the truth direction, converging instead to the maximum margin separator
Motivates the introduction of mass-mean probing as an alternative to LR
Source paper
extracted_from(2023) · Samuel Marks · Max Tegmark
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Frameworks (2)
framework
- Superposition HypothesissupportsCore theoretical framework: neural networks represent more features than neurons by encoding features as directions in superposition
- Mass-Mean ProbingsupportsIntroduced in this paper: an optimization-free probing technique using difference-in-means direction with optional covariance correction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central empirical conclusion of the paper about the fundamental limits of truth directions.
- Load-bearing interpretive claim about the layer-specificity of Burger et al.'s finding.
- Safety implication derived from multi-dimensional truth structure finding
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.756Question about practical safety application of feature monitoring.
- Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.hypothesis0.752Explanation for why dictionary learning can recover many more features than dimensions.
- Truth-related directions reliably emerge at 60–75% of normalized layer depth in Qwen and Gemma modelsfinding0.752Experiment 1 finding localizing where truth can be causally mediated
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
- Motivating hypothesis for Section 5's investigation of prompt template effects.