method
active
method:algorithm-1-harmlessness-classificationAlgorithm 1: Harmlessness Classification
Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Behavioral Null SpaceimplementsThe span of vector directions that do not change network behavior; a key concept distinguishing MAS from model stitching.
Methods (1)
method
- Measures off-manifold distance by computing the orthogonal residual from the local tangent subspace
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Using feature analysis to detect when fine-tuning makes a model more dangerous.
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Constitutional AI method whose constitutions, if changed, could trigger alignment faking
- Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.744VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
- CoT improves accuracy on HHH evals and makes the decision process legible.
- A set of evaluation criteria for AI assistants.
- Algorithm used to calibrate per-latent threshold boost values for consistent first-attempt difficulty