question
active
question:what-factors-determine-the-generalisation-of-learned-alignment-maps-beyond-training-dataWhat factors determine the generalisation of learned alignment maps beyond training data?
Open question about the gap between Theorem 1's existence proof and practical learnability
Source paper
extracted_from(2025) · Sutter, Denis · Minder, Julian · Hofmann, Thomas · Pimentel, Tiago
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (1)
claim
- Authors' proposed criterion for meaningful causal abstraction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows high IIA on random models depends on entity overlap; generalisation is essential for genuine interpretation
- Authors' interpretation of prompt variation results showing alignment faking disappears only when conflicting objective is removed
- Open methodological question acknowledged as limitation
- Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
- Distillation of why learning generalises.
- A promising property for interpretability analysis off-distribution.
- Hypothesis raised in distributive law task analysis