method
active
method:same-concept-steeringSame-concept steering
Steering using the same concept direction as is being measured, testing whether internal-state shifts causally affect the model's report of that state
Neighborhood — ranked by edge-count
Methods (2)
method
- Concept Steeringrelated_toLatent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
- Activation SteeringextendsCausal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.
Claims (1)
claim
- The coupling between LLM self-report and internal emotive state is causal, not merely correlationalassociated_withSupported by same-concept steering experiments showing monotonic shifts in self-report with activation steering
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Steering one concept direction while measuring introspection for a different concept, yielding a 4×4 steering-concept × measured-concept matrix to test concept-specific modulability
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
- Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
- Addresses skeptical alternative that reports reflect only conversational content
- Paradigm of finding the right geometry (manifold) for principled control.
- Ability to steer model behavior in two opposite semantic directions on a trait.
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.