concept
active
concept:causal-mediation-of-model-outputsCausal Mediation of Model Outputs
The extent to which a probe direction, when intervened upon, actually changes model outputs — contrasted with mere classification accuracy
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Causal Mediationrelated_toWhether an internal direction causally controls a target behavior, verified by intervention success
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Framework for evaluating whether probe directions are causally implicated in model outputs via activation patching
- Formal representation of algorithms as directed acyclic graphs computing functions f_A
- Nonlinear models of brain dynamics that can be inverted via DEM.
- The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
- Central question: does geometry in activation space causally determine behavior?
- A model mapping outputs to expected inputs, used in motor control and perception for embodiment.
- Consists of input, intermediate, and output variables with associated causal mechanisms; the mathematical object central to DAS.
- Attention restricted to previous tokens only, as in decoder-only models; leads to AR(ω)-like behaviour and no ordered phase