concept
active
concept:probing-complexity-accuracy-trade-offProbing Complexity–Accuracy Trade-off
Longstanding debate from probing literature about whether complex probes reveal genuine encodings or just memorise; this paper revives it for causal abstraction
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors connect their finding to the prior probing literature debate
Concepts (1)
concept
- Non-Linear Representation Dilemmaanalogous_toCore contribution: the impasse where lifting linearity in alignment maps makes causal abstraction vacuous, but keeping it may miss non-linearly encoded features
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Criterion requiring that model's description of internal state be accurate, distinguishing genuine introspection from confabulation.
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.741Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Motivation for causal evaluation over purely behavioural probing accuracy
- Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
- Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence