framework
active
framework:causalgym

CausalGym

Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym

Neighborhood — ranked by edge-count

Methods (9)

method
  • The core method introduced in this paper: finds alignments between high-level causal variables and distributed neural representations via gradient descent.
  • Statistical method used to analyze neural activity data.
  • Core intervention method used throughout CausalGym; operates on one-dimensional non-basis-aligned subspace of activation space
  • Method for extracting linear directions by subtracting mean activations of contrastive groups; used to define the Assistant Axis
  • Used to evaluate representation quality across VTAB tasks
  • Unsupervised feature-finding method using cluster centroid difference as feature direction
  • IID mass-mean probing coincides with LDA when covariance is known; used to derive the corrected probe formula
  • Adapted control task metric measuring difference between odds-ratio on original task and arbitrary-label control task
  • Primary evaluation metric measuring causal effect of interventions; greater value indicates larger causal effect

Frameworks (2)

framework
  • Benchmarking paradigm using minimally-different grammatical sentence pairs to test LM linguistic competence
  • SyntaxGym
    extends
    Online platform for targeted evaluation of language models that CausalGym adapts

Datasets (1)

dataset
  • Suite of 10 language models from 14M to 12B parameters trained on same data in same order, used for all experiments

Artifacts (1)

artifact

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.