finding
active
finding:unsupervised-behavior-clustering-surfaces-concerning-learned-patterns-without-prior-labelsUnsupervised behavior clustering surfaces concerning learned patterns without prior labels
Empirical finding: unsupervised clustering reveals problematic patterns without needing labeled data.
Source paper
extracted_from(2026) · Frank Xiao · Santiago Aranguri
Neighborhood — ranked by edge-count
Claims (1)
claim
- Authors' central interpretive assertion that their method meaningfully mitigates unwanted behaviors.
Methods (1)
method
- Probe-Based Data AttributionsupportsLinear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Method that clusters behaviors without prior labels, used to surface concerning learned patterns.
- Learning that builds a low-dimensional model of input data without error signals or rewards; Hebbian learning is an example.
- Grouping similar model behaviors; the unsupervised method surfaces clusters of concerning patterns.
- Clarifies what unsupervised learning does.
- Probing approach avoiding supervision to sidestep complexity-accuracy tradeoff
- Layer-wise trajectories show early enrichment, mid-layer alignment, and late re-clustering.claim0.753Qualitative geometry pattern.
- Central claim from connectionist models: complex coordination emerges without centralized control or external teacher.
- Decoder cosine similarity maps onto concept similarity.