concept
active
concept:probe-scoreProbe score
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Previous-turn probe scoreextendsInternal state measured at the turn BEFORE the self-report question is appended; ensures measurement of spontaneous internal state uncontaminated by rating question
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretability tools that decode information from internal model activations; here, linear probes are used for data attribution.
- Primary metric for all benchmarks, measuring fraction of tasks that meet benchmark-specific pass criteria
- Continuous 0-1 metric assigned by Deepseek-V3 evaluator measuring degree of deception in model responses
- Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
- Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
- Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
- Claude 4.5 Haiku used to segment responses into attempts and score each attempt 0-100 for relevance
- Factor analysis on 2224 data points revealing PC1 explains 82% of variance; six dimensions are not independent