concept
active
concept:shattering

shattering

The phenomenon where SAEs break a smooth geometric manifold into many small, seemingly unrelated pieces, losing overarching structure.

Neighborhood — ranked by edge-count

Claims (1)

claim

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • Symmetry Breakingconcept0.761
  • Underminemethod0.751
    Attribute: undercutting the authority of another text, often through subordinate commentary.
  • Jailbreakingconcept0.749
    Users coaxing dialogue agents into issuing threats or toxic content by overriding intended persona constraints
  • REINFORCEframework0.741
    Classical RL algorithm adapted by the paper with modifications including clipped-surrogate losses and length-normalized advantages for agentic training.
  • Jailbreakconcept0.731
    Methods to bypass model safety training; features may activate during jailbreaks.
  • Crack the hoodconcept0.730
    User desire to understand and modify internal structure of tools; central motivation for Playground's transparency and accessibility.
  • Transformations that break the wholeness, creating jaggedness and preventing life; cannot reach the descendants of nothingness.
  • complianceconcept0.727
    The model's tendency to comply with harmful requests, the opposite of refusal.