concept
active
concept:fine-tuning-harmfulness-detectionFine-tuning harmfulness detection
Using feature analysis to detect when fine-tuning makes a model more dangerous.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Parameter updates that reduce mismatch dr; another anchoring variant in UCCT.
- The patient, hand-guided adjustment of shape and dimension to each unique condition in a building; requires materials that make it economical and easy.
- Re-running probabilistic bisection on each fine-tuned checkpoint to normalize first-attempt difficulty
- Technique used to impose guardrails on base LLMs, analogized to censorship on the simulator's range of simulacra
- What unintended consequences might SOO fine-tuning produce in complex or real-world applications?question0.779Open research question about potential negative side effects of SOO
- Future work hypothesis about extending SOO to direct value alignment
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.773Question about practical safety application of feature monitoring.
- Proposed algorithm using local PCA to classify a divergence vector as harmless or harmful via behavioral null-space testing