Path-Specific Objectives

An approach training agents to avoid unsafe pathways leading to deception, informed by Causal Influence Diagrams

Neighborhood — ranked by edge-count

paper

framework

Self-Other Overlap (SOO) Fine-Tuning
extends
The central framework proposed in this paper: aligning AI internal representations of self and others to reduce deceptive behavior

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Path Integrationconcept0.760
Neural mechanism for tracking location through accumulation of self-movement vectors; shown to play the role of position encodings in TEM.
Behavior-based Pathconcept0.759
The path in activation space derived by optimizing steering interventions to produce outputs along the behavior manifold, independent of representation geometry.
care pathconcept0.755
A trajectory through time and space that an agent's care follows, potentially becoming infinite via the bodhisattva vow.
Representation-based Pathconcept0.755
The path in activation space derived by fitting the representation manifold, used to steer along the geometric structure of internal representations.
Goal Oriented Programmingframework0.744
Task-specific parameterconcept0.743
Parameter specific to each task, e.g., task head.
Path Expansion Methodmethod0.741
The core analytical technique of expanding transformer computations from layer-by-layer products into sums of end-to-end path terms for independent analysis
Path Patchingmethod0.733
Method by Goldowsky-Dill et al. 2023 for localizing model behavior via targeted activation interventions