concept
active
concept:data-centric-alignment

Data-Centric Alignment

Alignment approach that focuses on curating or modifying training data; the paper bridges this with interpretability methods.

Neighborhood — ranked by edge-count

Methods (1)

method
  • Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Alignmentconcept0.833
    The goal of making model behavior match human values and intentions, often addressed during post-training.
  • Approach emphasizing data quality and source identification rather than only model architecture changes.
  • Measure of similarity between the similarity structures (kernels) induced by two different representations
  • The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
  • AI alignmentconcept0.766
    Field within which this work has implications for evaluating alignment progress.
  • Alignment Functionconcept0.763
    A learnable invertible transformation in DAS that maps neural representations to a basis aligned with causal variables
  • Simplest alignment map ϕ(h)=h, equivalent to assuming privileged bases hypothesis
  • Alignment Map (ϕ)concept0.759
    The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied