Probing Complexity–Accuracy Trade-off

Longstanding debate from probing literature about whether complex probes reveal genuine encodings or just memorise; this paper revives it for causal abstraction

Neighborhood — ranked by edge-count

Claims (1)

claim

The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-off
extends
Authors connect their finding to the prior probing literature debate

Concepts (1)

concept

Non-Linear Representation Dilemma
analogous_to
Core contribution: the impasse where lifting linearity in alignment maps makes causal abstraction vacuous, but keeping it may miss non-linearly encoded features

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Accuracy Criterionconcept0.744
Criterion requiring that model's description of internal state be accurate, distinguishing genuine introspection from confabulation.
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.741
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Are high-accuracy probe representations also causally relevant for the task?question0.741
Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.732
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
A probe achieving high classification accuracy provides no guarantee that the model actually distinguishes those classes in downstream computationsclaim0.730
Motivation for causal evaluation over purely behavioural probing accuracy
Probing Methodsmethod0.726
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Accuracy does not vary linearly with latent reflection directions; instead it follows a more non-linear mapping that requires deeper theoretical treatment.claim0.725
Theoretical limitation identified by the authors distinguishing reflection from stylistic tasks.
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.724
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence