method
active
method:logistic-regression-correctness-probeLogistic regression correctness probe
Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction
Neighborhood — ranked by edge-count
Frameworks (1)
framework
- ReflCtrlusesThe proposed framework for probing and steering self-reflection behavior in reasoning LLMs via representation engineering
Methods (2)
method
- Logistic Regression Proberelated_toStandard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
- Feature extraction method computing cosine similarity of hidden representations with reflection direction across all layers
Hypotheses (1)
hypothesis
- Core hypothesis linking internal uncertainty to self-reflection behavior, tested via probing experiments
Claims (1)
claim
- Interpretive claim from probing experiment showing reflection direction features outperform baseline for uncertainty prediction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Fit a sigmoid to accuracy vs. k to estimate k50 and phase width.
- Linear classifier approach applied to model activations to identify which training datapoints caused undesired behaviors in post-training.
- Ridge regression fit on top-256 PCs of Gemini embeddings to predict model layer-40 activations and compute residuals
- Authors' explicit epistemic limitation on the threshold model
- Key methodological claim: MM probes are both competitive in accuracy and superior in causal influence
- Motivates the introduction of mass-mean probing as an alternative to LR
- Classification-based comparison of interpretation abilities across IIT metrics and Span Representation for ToM score categories.