claim
active
claim:direct-probes-over-learned-activations-in-standard-basis-may-fail-to-reveal-the-actual-causal-role-of-representations-because-they-are-highly-distributedDirect probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributed
Supported by the finding that non-trivial rotations are required to find aligned representations.
Source paper
extracted_from(2023) · Atticus Geiger · Zhengxuan Wu · Christopher Potts · Thomas Icard +1
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- Learned rotations reveal that direct probes over standard activation bases would miss the actual causal role of representations.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
- Motivated by the finding that lexical entailment decomposes into word identities.
- Tested in Section 4.4 calibration experiment; confirmed by findings.
- Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
- Key quote connecting path redundancy to interferometric information encoding.
- Core research question motivating NLA development and validation through case studies and causal interventions.