concept
active
concept:interpretability-driven-steering

Interpretability-driven steering

General approach of using interpretability feedback to steer model generation.

Neighborhood — ranked by edge-count

Methods (1)

method
  • Technique using internal model representations as feedback loops to steer diffusion-based materials generation toward target properties.

Concepts (2)

concept
  • Framework of using internal-state representations to control or steer generative models; conceptually parallel to manifold steering in language models.
  • Manifold Steering
    associated_with
    Central framework: steering neural networks by intervening along the curved manifold where a concept lives, rather than in straight lines through activation space.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Paradigm of finding the right direction in activation space (e.g., linear steering).
  • interpretabilityconcept0.777
    The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
  • Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
  • Method using large language models (Claude) to generate and test explanations of features at scale
  • Novel method that applies intervention only when the model begins a new thinking step (at the \n\n delimiter) rather than at every token
  • Model Steeringconcept0.745
    Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time
  • The central phenomenon introduced by this paper: inference-time recovery from irrelevant activation steering in LLMs
  • Paradigm of finding the right geometry (manifold) for principled control.