method
active
method:causal-intervention-via-activation-shiftCausal Intervention via Activation Shift
Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
- Traditional mechanistic accounts (Danto, Chisholm, Goldman) that Juarrero critiques as resting on outdated Newtonian causality.
- Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett
- The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
- Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
- Whether an internal direction causally controls a target behavior, verified by intervention success
- Core concept: degree to which an agent exerts unique predictive power on its future; key to cognition at all scales.
- Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models