quote

active

quote:if-steering-in-a-purported-concept-direction-does-not-shift-self-report-in-the-expected-direction-probe-quality-becomes-suspect-especially-when-conventional-probe-metrics-alone-looked-acceptable

If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.

Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe qualityclaim0.830
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
When self-report changes significantly while a linear probe stays flat, is the probe misspecified or the self-report spurious?question0.824
Key interpretive question the framework helps address through convergent validation logic
When probe and self-report agree and move together causally, confidence in both increases as evidence they track the same underlying stateclaim0.819
Convergent validity logic applied to LLM interpretability; probes validate self-reports and vice versa
Concept steering with target vs off-target probe area metric reveals three operational regimes (selectively steerable, encoded but entangled, non-encoded) across SleepFM, REVE, LaBraM.finding0.812
Result categorizing concept steerability into three distinct regimes.
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shiftsclaim0.811
Addresses skeptical alternative that reports reflect only conversational content
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.803
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
A probe may achieve high performance even on representations that are not causally relevant for the taskclaim0.802
Key interpretive claim from Case Study II distinguishing probe accuracy from causal relevance
Feature steering was effective in 5 out of 7 cases where few-shot probe steering vectors failed to produce meaningful behavior change.finding0.801
Empirical comparison showing advantage of SAE features in low-data regime.