Principled Control via Intervention on Internals

The goal of mechanistically-grounded, reliable control of neural network behavior via activation interventions

Neighborhood — ranked by edge-count

Concepts (2)

concept

principled control via internals using geometry
related_to
Claim that geometry enables accurate intervention; steering moves from direction-finding to geometry-finding.
Manifold Steering
implements
Central framework: steering neural networks by intervening along the curved manifold where a concept lives, rather than in straight lines through activation space.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

steering (intervention on internals)concept0.828
General technique of modifying activations to control model behavior.
The intervention is the basic primitive of pyvene, specified with a dict-based format rather than expressed as code executed at runtimeclaim0.760
Design philosophy claim distinguishing pyvene's approach from prior libraries
Intentional Control of Internal Statesfinding0.759
Models can modulate their internal representations when instructed or incentivized to 'think about' a concept; effect replicates across all tested models regardless of capability.
Intentional control of internal representations likely piggybacks on existing mechanisms for talking about a topicclaim0.756
Mechanism speculation for the intentional control experiment.
Intervention Strength (Alpha)concept0.752
Scalar parameter modulating how strongly a steering vector shifts model activations; set to 15 for Exp1 and ±16 for Exp2
Subspace Interventionconcept0.748
Intervention targeting specific dimensional subsets of activation vectors rather than full representations
The relative amount of energy or effort used in an intervention compared to the change in a system's behavior formalizes the distinction between purely mechanical control and high-level agency.claim0.742
Proposed formalization of the spectrum from mechanical to cognitive control via energy-efficiency of intervention
Intervention Propagationconcept0.740
Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control