concept
active
concept:shatteringshattering
The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.
Neighborhood — ranked by edge-count
Claims (1)
claim
- Core critique of sparse autoencoders: they break the geometric structure of representations, making it harder to see the big picture.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Attribute: undercutting the authority of another text, often through subordinate commentary.
- Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
- Classical RL algorithm adapted by the paper with modifications including clipped-surrogate losses and length-normalized advantages for agentic training.
- Methods to bypass model safety training; features may activate during jailbreaks.
- User desire to understand and modify internal structure of tools; central motivation for Playground's transparency and accessibility.
- Transformations that break the wholeness, creating jaggedness and preventing life; cannot reach the descendants of nothingness.
- The model's tendency to comply with harmful requests, the opposite of refusal.