Same-concept steering

Steering using the same concept direction as is being measured, testing whether internal-state shifts causally affect the model's report of that state

Neighborhood — ranked by edge-count

Methods (2)

method

Concept Steering
related_to
Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
Activation Steering
extends
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Claims (1)

claim

The coupling between LLM self-report and internal emotive state is causal, not merely correlational
associated_with
Supported by same-concept steering experiments showing monotonic shifts in self-report with activation steering

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cross-concept steeringmethod0.841
Steering one concept direction while measuring introspection for a different concept, yielding a 4×4 steering-concept × measured-concept matrix to test concept-specific modulability
direction-based steeringconcept0.760
Paradigm of finding the right direction in activation space (e.g., linear steering).
Representation Steeringconcept0.756
Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
Model Steeringconcept0.744
Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.742
Addresses skeptical alternative that reports reflect only conversational content
geometry-based steeringconcept0.740
Paradigm of finding the right geometry (manifold) for principled control.
Bidirectional Steeringconcept0.738
Ability to steer model behavior in two opposite semantic directions on a trait.
steering vectorsconcept0.738
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.