Fine-tuning harmfulness detection

Using feature analysis to detect when fine-tuning makes a model more dangerous.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Fine-tuningconcept0.817
Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
Fine Tuning and Adaptationconcept0.792
The patient, hand-guided adjustment of shape and dimension to each unique condition in a building; requires materials that make it economical and easy.
Fine-Tuning Threshold Recalibrationmethod0.788
Re-running probabilistic bisection on each fine-tuned checkpoint to normalize first-attempt difficulty
Fine-Tuning via Reinforcement Learningmethod0.781
Technique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra
What unintended consequences might SOO fine-tuning produce in complex or real-world applications?question0.779
Open research question about potential negative side effects of SOO
SOO fine-tuning could be extended to align AI representations of its own goals with human user preferences, reducing misalignment by fostering coherence between self-related and other-related preferenceshypothesis0.773
Future work hypothesis about extending SOO to direct value alignment
can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.773
Question about practical safety application of feature monitoring.
Algorithm 1: Harmlessness Classificationmethod0.771
Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing