Causal Intervention via Activation Shift

Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments

Neighborhood — ranked by edge-count

paper

method

Causal Intervention via Activation Shifting
same_as
Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Causal Intervention on Representationsconcept0.825
The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
Causal Theories of Actionconcept0.790
Traditional mechanistic accounts (Danto, Chisholm, Goldman) that Juarrero critiques as resting on outdated Newtonian causality.
causal bypassingconcept0.790
Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett
Path-Based Activation Interventionmethod0.787
The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
How do interventions on representations causally steer behavior?question0.770
Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
Causal Mediationconcept0.769
Whether an internal direction causally controls a target behavior, verified by intervention success
Causal Emergenceconcept0.767
Core concept: degree to which an agent exerts unique predictive power on its future; key to cognition at all scales.
Activation Correlationmethod0.766
Pearson correlation of feature activations across 40M tokens used to measure feature similarity and universality across models