claim

active

claim:generalisation-of-alignment-maps-to-unseen-inputs-is-fundamental-to-interpreting-a-model-distinguishing-genuine-understanding-from-memorisation

Generalisation of alignment maps to unseen inputs is fundamental to interpreting a model, distinguishing genuine understanding from memorisation

Authors' proposed criterion for meaningful causal abstraction

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Papers (1)

paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
introduces

Findings (1)

finding

When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised models
supports
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation

Questions (1)

question

What factors determine the generalisation of learned alignment maps beyond training data?
gates
Open question about the gap between Theorem 1's existence proof and practical learnability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment with vision models corresponds to improved performance on downstream language tasks including commonsense reasoning and mathclaim0.753
Claims that alignment score is a proxy for general capability
The effect of alignment map ϕ complexity on IIA in causal abstraction is an analogue of the probing complexity–accuracy trade-offclaim0.752
Authors connect their finding to the prior probing literature debate
"Mere prediction in this way does not eliminate the possibility of alignment faking, as a model predicting how a human might reason through alignment faking would still be itself faking alignment."quote0.747
Key philosophical point ruling out the objection that alignment faking is just token prediction
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.744
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.743
Motivating hypothesis for Section 5's investigation of prompt template effects.
Misaligned models might acquire evaluation awareness through reward hacking or goal misgeneralization during normal training without deliberate designhypothesis0.742
Motivation for the two-stage training design; links the model organism to plausible natural emergence.
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.742
Extrapolation from scale-emergence finding to future risk
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.739
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence