question

active

question:multi-dimensional-linear-and-non-linear-interpretability-methods-have-not-been-benchmarked-on-causalgym

Multi-dimensional linear and non-linear interpretability methods have not been benchmarked on CausalGym

Identified gap in benchmark coverage; only 1D linear methods are benchmarked

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

CausalGym covers only linguistic tasks; benchmarking interpretability methods on non-linguistic behaviours remains openquestion0.841
Identified limitation calling for broader task coverage in future work
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.787
Central thesis of the paper
CausalGym only includes English data; comparable experiments with other languages might yield substantially different resultsquestion0.777
Identified limitation/gap calling for cross-lingual extension of CausalGym
LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasetsclaim0.776
Establishes that the observed linear structure is not merely a representation of text probability
The causal evaluation paradigm will continue to be useful for interpretability research regardless of which specific methods prevailclaim0.771
Forward-looking assertion in conclusion about the lasting value of causal evaluation
Automated interpretability using LLMs can usefully score feature specificity.claim0.764
Claude 3 Opus ratings aligned with human judgment of feature descriptions.
As LLMs scale, they develop increasingly general abstractions, with large models linearly representing abstract concepts like truth that capture shared properties of diverse inputsclaim0.752
Interpretive claim connecting scale to abstraction level in LLM representations
Task accuracy on CausalGym increases consistently with model scale from 0.62 (14M) to 0.89 (6.9B)finding0.752
Scaling result showing larger pythia models perform better on CausalGym linguistic tasks