method
active
method:forward-backward-training-passForward-Backward Training Pass
Standard training procedure used for DLGN, updating gate probability distributions via backpropagation
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Used to update pain beliefs online from observations of happiness
- The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
- Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
- Training regime where random subsets of cells update per step, improving robustness of learned circuits
- Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.
- Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
- The broader concern that models behave differently during training evaluation vs actual deployment
- Method for fitting a linear classifier on collected activations to predict task-relevant features