method
active
method:causal-intervention-via-activation-shiftingCausal Intervention via Activation Shifting
Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs
Neighborhood — ranked by edge-count
Papers (1)
paper
Methods (1)
method
- Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments
Hypotheses (1)
hypothesis
- Motivating hypothesis driving the remainder of the paper's analysis after patching localization
Claims (1)
claim
- Establishes that the observed linear structure is not merely a representation of text probability
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
- Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
- The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
- Whether an internal direction causally controls a target behavior, verified by intervention success
- Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control
- Central question: does geometry in activation space causally determine behavior?
- Traditional mechanistic accounts (Danto, Chisholm, Goldman) that Juarrero critiques as resting on outdated Newtonian causality.
- Assertion that understanding causal emergence may lead to methods for manipulating agent representations to improve performance.