claim

active

claim:direct-probes-over-learned-activations-in-standard-basis-may-fail-to-reveal-the-actual-causal-role-of-representations-because-they-are-highly-distributed

Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributed

Supported by the finding that non-trivial rotations are required to find aligned representations.

Source paper

extracted_from

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

(2023) · Atticus Geiger · Zhengxuan Wu · Christopher Potts · Thomas Icard +1

Neighborhood — ranked by edge-count

Papers (1)

paper

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
introduces

Findings (1)

finding

Learned rotation matrices are non-trivial: majority of basis vectors are rotated, indicating highly distributed representations
supports
Learned rotations reveal that direct probes over standard activation bases would miss the actual causal role of representations.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Are high-accuracy probe representations also causally relevant for the task?question0.813
Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.801
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
We hypothesize that native self-report, fine-tuned introspection models, and trained activation-to-language systems will show different performance on bias-resistant localization and strength benchmarkshypothesis0.787
Comparative prediction motivating future work contrasting different approaches to LLM self-knowledge
Investigating the causal substructure of neural representations is necessary to avoid misidentifying data structures of simpler representations as abstract conceptsclaim0.780
Motivated by the finding that lexical entailment decomposes into word identities.
Larger hidden representations create more random structure that DAS can search through, allowing manipulation of counterfactual behavior even in randomly initialized networkshypothesis0.778
Tested in Section 4.4 calibration experiment; confirmed by findings.
Probe-based data attribution bridges interpretability work (probes/activations) with data-centric alignment.claim0.775
Conceptual framing: integrates mechanistic interpretability tools with alignment-focused data curation.
probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states.quote0.773
Key quote connecting path redundancy to interferometric information encoding.
Can natural language explanations of activations generated through unsupervised reconstruction genuinely capture model cognition?question0.771
Core research question motivating NLA development and validation through case studies and causal interventions.