quote
active
quote:if-steering-in-a-purported-concept-direction-does-not-shift-self-report-in-the-expected-direction-probe-quality-becomes-suspect-especially-when-conventional-probe-metrics-alone-looked-acceptable

If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.

Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics

Source paper

extracted_from
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
(2026) · Nicolas Martorell · Bianchi, Bruno

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.