quote
active
quote:if-steering-in-a-purported-concept-direction-does-not-shift-self-report-in-the-expected-direction-probe-quality-becomes-suspect-especially-when-conventional-probe-metrics-alone-looked-acceptableIf steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
- Key interpretive question the framework helps address through convergent validation logic
- Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
- Result categorizing concept steerability into three distinct regimes.
- Addresses skeptical alternative that reports reflect only conversational content
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
- Empirical comparison showing advantage of SAE features in low-data regime.