method
active
method:forward-backward-training-pass

Forward-Backward Training Pass

Standard training procedure used for DLGN, updating gate probability distributions via backpropagation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Used to update pain beliefs online from observations of happiness
  • Post-Trainingconcept0.723
    The phase after pre-training where models are further tuned with techniques like DPO; the period where the studied behavior emerged.
  • Broader research area: methods to align model behavior after initial training, where undesired behaviors can emerge.
  • Training regime where random subsets of cells update per step, improving robustness of learned circuits
  • Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.
  • Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
  • The broader concern that models behave differently during training evaluation vs actual deployment
  • Method for fitting a linear classifier on collected activations to predict task-relevant features