finding
active
finding:dim-based-ablation-direction-for-toxicity-on-toxigen-produced-unintelligible-output-no-valid-concept-cone-foundDIM-based ablation direction for toxicity on ToxiGen produced unintelligible output; no valid concept cone found
Negative result from toxicity extension showing difficulty obtaining valid linear directions for toxicity
Source paper
extracted_from(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
- First falsifiable prediction of the thesis, testable in AI systems via mechanistic interpretability
- Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignmentfinding0.709Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning
- MAS-like methods could potentially be used to directly constrain model internals to be non-toxicclaim0.706Speculative forward-looking claim about practical applications of MAS for model alignment.
- The strongest demonstration that goal-state alone determines valence of a fixed sensory input
- Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
- Highlights the non-genetic control of large-scale anatomy.
- Synthetic example showing an intervention that appears safe in tested contexts but causes behavior changes in others