concept
active
concept:principled-control-via-intervention-on-internalsPrincipled Control via Intervention on Internals
The goal of mechanistically-grounded, reliable control of neural network behavior via activation interventions
Neighborhood — ranked by edge-count
Concepts (2)
concept
- Claim that geometry enables accurate intervention; steering moves from direction-finding to geometry-finding.
- Manifold SteeringimplementsCentral framework: steering neural networks by intervening along the curved manifold where a concept lives, rather than in straight lines through activation space.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- General technique of modifying activations to control model behavior.
- Design philosophy claim distinguishing pyvene's approach from prior libraries
- Models can modulate their internal representations when instructed or incentivized to 'think about' a concept; effect replicates across all tested models regardless of capability.
- Mechanism speculation for the intentional control experiment.
- Scalar parameter modulating how strongly a steering vector shifts model activations; set to 15 for Exp1 and ±16 for Exp2
- Intervention targeting specific dimensional subsets of activation vectors rather than full representations
- Proposed formalization of the spectrum from mechanical to cognitive control via energy-efficiency of intervention
- Property that additive modifications to activations affect all downstream computations, enabling tractable behavioral control