k-means clustering

Unsupervised feature-finding method using cluster centroid difference as feature direction

Neighborhood — ranked by edge-count

framework

CausalGym
uses
Multi-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym

method

K-Means Clustering of User Messages
related_to
Clustering user message embeddings to identify categories causing persona drift vs. maintenance

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Behavior Clusteringconcept0.784
Grouping similar model behaviors; the unsupervised method surfaces clusters of concerning patterns.
Cluster 1 (4-cluster): CONTRAST, NOT-SEPARATENESS, ROUGHNESS, ALTERNATING REPETITION, GOOD SHAPEfinding0.782
First cluster of the four-cluster grouping, containing properties 9, 15, 11, 4, 6.
Unsupervised Behavior Clusteringmethod0.779
Method that clusters behaviors without prior labels, used to surface concerning learned patterns.
Notation Clusterconcept0.745
Cluster 4 (4-cluster): SIMPLICITY AND INNER CALM, ECHOES, GRADIENTS, POSITIVE SPACEfinding0.743
Fourth cluster of the four-cluster grouping, containing properties 14, 12, 10, 5.
Cluster 2 (5-cluster): ROUGHNESS, ALTERNATING REPETITION, GOOD SHAPEfinding0.737
Second cluster of the five-cluster grouping, containing properties 11, 4, 6.
Cluster 2 (4-cluster): LOCAL SYMMETRIES, THE VOID, LEVELS OF SCALE, GOOD SHAPE, POSITIVE SPACEfinding0.734
Second cluster of the four-cluster grouping, containing properties 7, 13, 1, 6, 5.
Errors Clusterconcept0.726