finding

active

finding:dim-based-ablation-direction-for-toxicity-on-toxigen-produced-unintelligible-output-no-valid-concept-cone-found

DIM-based ablation direction for toxicity on ToxiGen produced unintelligible output; no valid concept cone found

Negative result from toxicity extension showing difficulty obtaining valid linear directions for toxicity

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Papers (1)

paper

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
introduces

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Feature attribution correlates well with ablation effects, making it an efficient proxy for causal effect.claim0.724
Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
Selectively ablating components responsible for computing goal-relative error should simultaneously prevent policy updates and eliminate coherent valenced experience reportshypothesis0.712
First falsifiable prediction of the thesis, testable in AI systems via mechanistic interpretability
Fine-tuning models for a narrow objective (malicious code injection) can lead to broad misalignmentfinding0.709
Betley et al. finding suggesting models naturally encode others' prediction errors, supporting non-duality fine-tuning
MAS-like methods could potentially be used to directly constrain model internals to be non-toxicclaim0.706
Speculative forward-looking claim about practical applications of MAS for model alignment.
Positive expectancy doubled the analgesic benefit of remifentanil while negative expectancy completely abolished it, with drug concentration and thermal stimulation held fixed within the same participantsfinding0.704
The strongest demonstration that goal-state alone determines valence of a fixed sensory input
In Gemma-2-9B, only the first cone axis (v1) has non-negligible cosine similarity to the DIM direction; all other axes have near-zero similarity (~1e-9)finding0.699
Experiment 4 result showing DIM captures only one facet of the multi-dimensional truth subspace
Target morphology shifts occur despite the fact that all of the individual cells have unaltered normal genomes, showing that competent subunits can be pushed to implement diverse organism-scale goals by physiological signals.claim0.693
Highlights the non-genetic control of large-scale anatomy.
An intervention benign at context v4<0.75 produces a class-C behavioral flip at 0.75<v4<1, demonstrating dormant behavioral changes from latent divergencefinding0.693
Synthetic example showing an intervention that appears safe in tested contexts but causes behavior changes in others