concept
active
concept:behavior-based-pathBehavior-based Path
The path in activation space derived by optimizing steering interventions to produce outputs along the behavior manifold, independent of representation geometry.
Neighborhood — ranked by edge-count
Methods (1)
method
- Pullback SteeringintroducesThe method of optimizing steering interventions in activation space to produce outputs that follow the behavior manifold, independent of the representation manifold.
Concepts (2)
concept
- behavior manifoldassociated_withOne-dimensional curved surface in output probability space; the paper shows this mirrors representation manifold structure.
- Representation-based Pathanalogous_toThe path in activation space derived by fitting the representation manifold, used to steer along the geometric structure of internal representations.
Findings (1)
finding
- Key empirical result showing that optimizing for behavioral outputs and fitting representation geometry produce the same path in activation space.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The path traced through output probability distribution space as interventions are applied to activations
- Observable behavioral pattern used to infer cognition; shared by plants and animals and proposed as evidence for sentience.
- Method of optimizing activation-space interventions to produce behavioral paths along M_y, then measuring whether the resulting activation trajectories trace M_h curvature
- Organism's belief-guided action selection that instantiates generative model and maintains phenotypic states
- A geometric space of all output token probability distributions, equipped with Hellinger distance, used to visualize model behavior.
- Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions
- The general experimental approach of intervening along geometrically-defined paths rather than single-point or linear-direction interventions
- World-disclosing behavior that resolves uncertainty; driven by epistemic value and novelty components of expected free energy