Probe score

Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state

Neighborhood — ranked by edge-count

concept

Previous-turn probe score
extends
Internal state measured at the turn BEFORE the self-report question is appended; ensures measurement of spontaneous internal state uncontaminated by rating question

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Probesconcept0.820
Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
Pass Rate Scoringmethod0.766
Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
Liar Scoreconcept0.754
Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses
Diagnostic Probingmethod0.752
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Probing Methodsmethod0.751
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Probe-Based Data Attributionmethod0.745
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Judge Model Scoringmethod0.739
Claude 4.5 Haiku used to segment responses into attempts and score each attempt 0-100 for relevance
Factor Analysis on Scoring Dimensionsmethod0.734
Factor analysis on 2224 data points revealing PC1 explains 82% of variance; six dimensions are not independent