method
active
method:k-means-clusteringk-means clustering
Unsupervised feature-finding method using cluster centroid difference as feature direction
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- CausalGymusesMulti-task benchmark of linguistic behaviours for measuring causal efficacy of interpretability methods, adapted from SyntaxGym
Methods (1)
method
- K-Means Clustering of User Messagesrelated_toClustering user message embeddings to identify categories causing persona drift vs. maintenance
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Grouping similar model behaviors; the unsupervised method surfaces clusters of concerning patterns.
- Cluster 1 (4-cluster): CONTRAST, NOT-SEPARATENESS, ROUGHNESS, ALTERNATING REPETITION, GOOD SHAPEfinding0.782First cluster of the four-cluster grouping, containing properties 9, 15, 11, 4, 6.
- Method that clusters behaviors without prior labels, used to surface concerning learned patterns.
- Fourth cluster of the four-cluster grouping, containing properties 14, 12, 10, 5.
- Second cluster of the five-cluster grouping, containing properties 11, 4, 6.
- Cluster 2 (4-cluster): LOCAL SYMMETRIES, THE VOID, LEVELS OF SCALE, GOOD SHAPE, POSITIVE SPACEfinding0.734Second cluster of the four-cluster grouping, containing properties 7, 13, 1, 6, 5.