Interpretability-driven steering

General approach of using interpretability feedback to steer model generation.

Neighborhood — ranked by edge-count

method

Self-Correcting Search
implements
Technique using internal model representations as feedback loops to steer diffusion-based materials generation toward target properties.

concept

Interpretability-Driven Feedback Steering
related_to
Framework of using internal-state representations to control or steer generative models; conceptually parallel to manifold steering in language models.
Manifold Steering
associated_with
Central framework: steering neural networks by intervening along the curved manifold where a concept lives, rather than in straight lines through activation space.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

direction-based steeringconcept0.784
Paradigm of finding the right direction in activation space (e.g., linear steering).
interpretabilityconcept0.777
The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
Representation Steeringconcept0.771
Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
Automated Interpretabilityframework0.760
Method using large language models (Claude) to generate and test explanations of features at scale
Stepwise steeringmethod0.748
Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
Model Steeringconcept0.745
Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
Endogenous Steering Resistanceconcept0.744
The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
geometry-based steeringconcept0.742
Paradigm of finding the right geometry (manifold) for principled control.