method
active
method:cross-concept-steering

Cross-concept steering

Steering one concept direction while measuring introspection for a different concept, yielding a 4×4 steering-concept × measured-concept matrix to test concept-specific modulability

Neighborhood — ranked by edge-count

Methods (2)

method
  • Latent intervention technique that manipulates sparse features to steer model predictions toward desired concepts.
  • Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Steering using the same concept direction as is being measured, testing whether internal-state shifts causally affect the model's report of that state
  • Paradigm of finding the right geometry (manifold) for principled control.
  • The method of optimizing steering interventions in activation space to produce outputs that follow the behavior manifold, independent of the representation manifold.
  • steering vectorsconcept0.746
    A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
  • Parent concept; the practice of controlling neural network outputs by manipulating internal representations.
  • Paradigm of finding the right direction in activation space (e.g., linear steering).
  • Ability to steer model behavior in two opposite semantic directions on a trait.
  • Model Steeringconcept0.739
    Using interventions to guide model generation behavior, e.g., adding sentiment vectors at inference time