finding

active

finding:l2-regularisation-with-bias-term-delivers-best-probe-performance-l2-regularisation-increases-probe-selectivity

L2 regularisation with bias term delivers best probe performance; L2 regularisation increases probe selectivity

Hyperparameter tuning result for probes; consistent with Hewitt and Liang 2019 finding

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Questions (1)

question

Why does L2 regularisation increase probe causal efficacy (selectivity)?
supports
Open question identified in hyperparameter tuning experiments, left for future work

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.770
Larger models linearly represent more general concepts including truth
The L_retain regularization objective is empirically effective at preserving unrelated model capabilities during cone trainingclaim0.759
Interpretation of low KL divergence results as validation of the training objective
Larger models should amplify bias less than smaller models, with model biases more accurately reflecting data biases rather than exacerbating themclaim0.755
Implication of PRH for AI fairness and bias
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.746
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
The target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.claim0.746
Justification for the novel metric introduced in the paper
We hypothesize that degraded generalization on benchmarks like MMLU may reflect the computational demands of the tasks.hypothesis0.745
Connecting the paper's task-difficulty findings to prior observations of weak generalization on complex QA benchmarks.
Probe-based ranking outperforms gradient-based and LLM-judge methods at lower costclaim0.744
Authors' claim that their approach is both more effective in reduction and cheaper than prior methods.
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of casesfinding0.744
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics