framework
active
framework:path-specific-objectivesPath-Specific Objectives
An approach training agents to avoid unsafe pathways leading to deception, informed by Causal Influence Diagrams
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Neural mechanism for tracking location through accumulation of self-movement vectors; shown to play the role of position encodings in TEM.
- The path in activation space derived by optimizing steering interventions to produce outputs along the behavior manifold, independent of representation geometry.
- A trajectory through time and space that an agent's care follows, potentially becoming infinite via the bodhisattva vow.
- The path in activation space derived by fitting the representation manifold, used to steer along the geometric structure of internal representations.
- Parameter specific to each task, e.g., task head.
- The core analytical technique of expanding transformer computations from layer-by-layer products into sums of end-to-end path terms for independent analysis
- Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions