Why does L2 regularisation increase probe causal efficacy (selectivity)?

Open question identified in hyperparameter tuning experiments, left for future work

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
associated_with

Findings (1)

finding

L2 regularisation with bias term delivers best probe performance; L2 regularisation increases probe selectivity
supports
Hyperparameter tuning result for probes; consistent with Hewitt and Liang 2019 finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Are high-accuracy probe representations also causally relevant for the task?question0.749
Question raised by the discrepancy between DAS IIA and linear probe accuracy in Case Study II
Probe-based data attribution effectively reduces harmful behaviors via data interventionsclaim0.740
Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.738
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Feature attribution correlates well with ablation effects, making it an efficient proxy for causal effect.claim0.737
Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.737
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
The target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.claim0.736
Justification for the novel metric introduced in the paper
Direct probes over learned activations in standard basis may fail to reveal the actual causal role of representations because they are highly distributedclaim0.736
Supported by the finding that non-trivial rotations are required to find aligned representations.
Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?question0.733
Open question raised in §7.1 about an unexplained anomalous result