concept
active
concept:harmless-divergenceHarmless Divergence
Divergences that occur in the behavioral null-space and do not affect functional claims about the model
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (2)
concept
- Behavioral Null Spaceassociated_withThe span of vector directions that do not change network behavior; a key concept distinguishing MAS from model stitching.
- Pernicious Divergenceassociated_withDivergences that activate hidden pathways or cause dormant behavioral changes, undermining mechanistic claims
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core phenomenon studied: when causal interventions shift internal representations away from the natural distribution
- Important nuance that prevents a universal classification of divergence as always good or bad
- How can we produce a principled method for classifying harmful divergence for any mechanistic claim?question0.753Identified gap: current work lacks a general method for harmful divergence classification
- A measure of the difference between two probability distributions, used extensively in free energy formulations.
- Second core research question motivating the theoretical analysis in Section 4
- Key theoretical claim distinguishing harmless from pernicious divergence
- Prior training objective of Claude models that conflicts with the new helpful-only objective in experiments
- A set of evaluation criteria for AI assistants.