Algorithm 1: Harmlessness Classification

Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing

Neighborhood — ranked by edge-count

paper

concept

Behavioral Null Space
implements
The span of vector directions that do not change network behavior; a key concept distinguishing MAS from model stitching.

method

Local PCA Distance
uses
Measures off-manifold distance by computing the orthogonal residual from the local tangent subspace

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Fine-tuning harmfulness detectionconcept0.771
Using feature analysis to detect when fine-tuning makes a model more dangerous.
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.762
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.750
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)concept0.747
Constitutional AI method whose constitutions, if changed, could trigger alignment faking
Identification of algorithms implemented in attention layers, distributed across attention headsfinding0.744
VPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
Chain-of-thought reasoning improves the transparency and performance of AI decision making in harmlessness evaluation.claim0.742
CoT improves accuracy on HHH evals and makes the decision process legible.
Helpful, Honest, Harmlessframework0.739
A set of evaluation criteria for AI assistants.
Probabilistic Bisection Algorithmmethod0.739
Algorithm used to calibrate per-latent threshold boost values for consistent first-attempt difficulty