method
active
method:concept-steeringConcept Steering
Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
Neighborhood — ranked by edge-count
Concepts (2)
concept
- wrecking-ball interventionassociated_withType of concept steering intervention that catastrophically collapses global model performance.
- Three Operational Regimes of SteeringintroducesThe three categories of SAE feature behavior under concept steering identified in the paper
Methods (3)
method
- Same-concept steeringrelated_toSteering using the same concept direction as is being measured, testing whether internal-state shifts causally affect the model's report of that state
- Cross-concept steeringrelated_toSteering one concept direction while measuring introspection for a different concept, yielding a 4×4 steering-concept × measured-concept matrix to test concept-specific modulability
- Target vs. Off-Target Probe Area MetricimplementsMetric introduced to quantify steering selectivity by comparing the area of target and off-target concept probes.
Events (1)
event
- Preprint applying TopK SAEs to three EEG transformers to reveal sparse feature dictionaries, steering regimes, and spectral interpretation.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Paradigm of finding the right direction in activation space (e.g., linear steering).
- Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
- Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
- Central entity of Jackson's framework: a structure invented to give coherent account of immediate consequences of actions; the building block of software design
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
- Paradigm of finding the right geometry (manifold) for principled control.
- The spatial/geometric organization of conceptual structure within neural network representations; central to the paper's thesis.