Logistic regression correctness probe

Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction

Neighborhood — ranked by edge-count

framework

ReflCtrl
uses
The proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering

method

Logistic Regression Probe
related_to
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Cosine projection on reflection direction
uses
Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers

hypothesis

Reasoning LLMs trigger reflection when their internal uncertainty is high
supports
Core hypothesis linking internal uncertainty to self-reflection behavior, tested via probing experiments

claim

Model's uncertainty information is encoded in the reflection direction
supports
Interpretive claim from probing experiment showing reflection direction features outperform baseline for uncertainty prediction

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

logistic fitting for shot thresholdsmethod0.755
Fit a sigmoid to accuracy vs. k to estimate k50 and phase width.
Probe-Based Data Attributionmethod0.752
Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
Ridge Regression Probingmethod0.746
Ridge regression fit on top-256 PCs of Gemini embeddings to predict model layer-40 activations and compute residuals
The logistic fit for threshold behavior is a phenomenological surrogate for interpretability, not a mechanistic derivationclaim0.745
Authors' explicit epistemic limitation on the threshold model
Probe-based data attribution for alignmentconcept0.740
Simple difference-in-mean probes generalize as well as other probing techniques while identifying directions which are more causally implicated in model outputsclaim0.739
Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
Logistic regression fails to identify the true feature direction when a confounding feature is non-orthogonal to the truth direction, converging instead to the maximum margin separatorclaim0.738
Motivates the introduction of mass-mean probing as an alternative to LR
5-fold Cross-Validated Logistic Regression AUCmethod0.737
Classification-based comparison of interpretation abilities across IIT metrics and Span Representation for ToM score categories.