framework
active
framework:path-specific-objectives

Path-Specific Objectives

An approach training agents to avoid unsafe pathways leading to deception, informed by Causal Influence Diagrams

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Path Integrationconcept0.760
    Neural mechanism for tracking location through accumulation of self-movement vectors; shown to play the role of position encodings in TEM.
  • The path in activation space derived by optimizing steering interventions to produce outputs along the behavior manifold, independent of representation geometry.
  • care pathconcept0.755
    A trajectory through time and space that an agent's care follows, potentially becoming infinite via the bodhisattva vow.
  • The path in activation space derived by fitting the representation manifold, used to steer along the geometric structure of internal representations.
  • Parameter specific to each task, e.g., task head.
  • The core analytical technique of expanding transformer computations from layer-by-layer products into sums of end-to-end path terms for independent analysis
  • Path Patchingmethod0.733
    Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions