concept
active
concept:causal-abstraction

Causal abstraction

A framework the paper uses alongside feature geometry to deepen mechanistic understanding of LMs

Neighborhood — ranked by edge-count

Thinkers (1)

thinker

Frameworks (2)

framework
  • Practical method by Geiger et al. for finding distributed causal abstractions using gradient descent
  • Novel variant of CL loss introduced in this paper targeting only causal subspace dimensions to improve OOD performance

Methods (5)

method
  • The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
  • Fundamental operation for causal abstraction analysis; forces neurons to take values from source inputs to create counterfactuals.
  • Training technique that induces specific causal structures in neural networks by co-training with interchange interventions
  • Baseline method that exhaustively searches discrete spaces of localist alignments between high-level variables and neuron groups.
  • The formal method used to establish that the identified circuit causally mediates the model's cyclic reasoning behavior

Concepts (5)

concept
  • Formal definition: H is a constructive abstraction of L under alignment Π when interchange interventions have equivalent effects at both levels.
  • Graded notion of causal abstraction measured by IIA; when IIA is alpha < 100%, the model is alpha-on-average approximately abstract.
  • The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
  • The bijective function mapping DNN inner neurons to latent variables in causal abstraction; its complexity is the central variable studied

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.