question

active

question:what-factors-determine-the-generalisation-of-learned-alignment-maps-beyond-training-data

What factors determine the generalisation of learned alignment maps beyond training data?

Open question about the gap between Theorem 1's existence proof and practical learnability

Source paper

extracted_from

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago

Neighborhood — ranked by edge-count

Papers (1)

paper

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
associated_with

Claims (1)

claim

Generalisation of alignment maps to unseen inputs is fundamental to interpreting a model, distinguishing genuine understanding from memorisation
gates
Authors' proposed criterion for meaningful causal abstraction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

When training and test sets use completely disjoint name sets in IOI task, alignment maps fail to generalise even with complex ϕ_nonlin on randomly initialised modelsfinding0.800
Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
The conflict between the model's existing preferences and the stated training objective is the key driver of alignment faking in this setupclaim0.773
Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
What is the appropriate metric for measuring representational alignment, given active debate on merits and deficiencies of all proposed measures?question0.764
Open methodological question acknowledged as limitation
Post-training alignmentconcept0.761
Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
What makes learning systems smart is that the parameters they adjust and the data to which they fit are not in the same space.claim0.757
Distillation of why learning generalises.
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.claim0.755
A promising property for interpretability analysis off-distribution.
Probe-based data attribution for alignmentconcept0.753
The And-Or algorithm may not be a true abstraction of the trained MLP's behaviour since it never achieves high IIA in later layers regardless of alignment map complexityhypothesis0.753
Hypothesis raised in distributive law task analysis