Causal Intervention via Activation Shifting

Method of shifting hidden state activations along probe directions to cause the model to treat false statements as true and vice versa; evaluated on OOD inputs

Neighborhood — ranked by edge-count

Papers (1)

paper

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
uses

Methods (1)

method

Causal Intervention via Activation Shift
same_as
Intervening in model forward pass by adding/subtracting probe direction to group (b) hidden states to flip truth judgments

Hypotheses (1)

hypothesis

We hypothesize that group (b) hidden states store a representation of the statement's truth
supports
Motivating hypothesis driving the remainder of the paper's analysis after patching localization

Claims (1)

claim

LLMs linearly represent truth-relevant information beyond the plausibility of text, as evidenced by probes trained on likely performing poorly on anti-correlated datasets
supports
Establishes that the observed linear structure is not merely a representation of text probability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Causal Intervention on Representationsconcept0.833
The use of interventions (rather than correlations) to establish a causal link between representation geometry and behavioral geometry.
How do interventions on representations causally steer behavior?question0.793
Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
Path-Based Activation Interventionmethod0.789
The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
Causal Mediationconcept0.788
Whether an internal direction causally controls a target behavior, verified by intervention success
Intervention Propagationconcept0.783
Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control
causal shaping of behavior by representation geometryconcept0.783
Central question: does geometry in activation space causally determine behavior?
Causal Theories of Actionconcept0.782
Traditional mechanistic accounts (Danto, Chisholm, Goldman) that Juarrero critiques as resting on outdated Newtonian causality.
Causal emergence can enable causal interventions to create better RL agents.claim0.781
Assertion that understanding causal emergence may lead to methods for manipulating agent representations to improve performance.