Data-Centric Alignment

Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.

Neighborhood — ranked by edge-count

paper

method

Probe-Based Data Attribution
cites
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignmentconcept0.833
The goal of making model behavior match human values and intentions, often addressed during post-training.
Data-centric machine learningconcept0.803
Approach emphasizing data quality and source identification rather than only model architecture changes.
Representational Alignmentconcept0.791
Measure of similarity between the similarity structures (kernels) induced by two different representations
Distributed Alignment Searchmethod0.777
The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
AI alignmentconcept0.766
Field within which this work has implications for evaluating alignment progress.
Alignment Functionconcept0.763
A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
Identity Alignment Map (ϕ_id)method0.761
Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
Alignment Map (ϕ)concept0.759
The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied