shattering

The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.

Neighborhood — ranked by edge-count

claim

SAE features tend to shatter manifolds into many small and apparently-unrelated pieces, obscuring the overarching semantic structure.
cites
Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Symmetry Breakingconcept0.761
Underminemethod0.751
Attribute: undercutting the authority of another text, often through subordinate commentary.
Jailbreakingconcept0.749
Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
REINFORCEframework0.741
Classical RL algorithm adapted by the paper with modifications including clipped-surrogate losses and length-normalized advantages for agentic training.
Jailbreakconcept0.731
Methods to bypass model safety training; features may activate during jailbreaks.
Crack the hoodconcept0.730
User desire to understand and modify internal structure of tools; central motivation for Playground's transparency and accessibility.
Structure-destroying transformationsconcept0.728
Transformations that break the wholeness, creating jaggedness and preventing life; cannot reach the descendants of nothingness.
complianceconcept0.727
The model's tendency to comply with harmful requests, the opposite of refusal.