Ridge Regression Probing

Ridge regression fit on top-256 PCs of Gemini embeddings to predict model layer-40 activations and compute residuals

Neighborhood — ranked by edge-count

concept

layer 40 residual-stream activations
about
The specific neural network layer from which activations are extracted for probe construction and SAE training in the target models

method

Ridge regression probe construction
related_to
Method used to predict model activations from Gemini embeddings and compute residuals for probe construction
Emotion Probe Construction Method
uses
Method for building 171 emotion probes by generating stories, embedding them, regressing out Gemini embeddings, and averaging residual activations per emotion

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Logistic Regression Probemethod0.800
Standard linear probing technique; compared to mass-mean probing for classification accuracy and causal implication
Ridge Regression on Message Embeddingsmethod0.793
Predicting Assistant Axis projections from L2-normalized Qwen 3 0.6B embeddings of user messages via ridge regression
Logistic regression correctness probemethod0.746
Logistic regression trained on GSM8k training set to predict answer correctness from projection features along reflection direction
Probing Methodsmethod0.743
Top-down interpretability approach studying linguistic properties at various residual stream stages; contrasted with the paper's bottom-up mechanistic approach
Diagnostic Probingmethod0.723
Earlier interpretability method applying classifiers to DNN hidden representations; shares complexity-accuracy dilemma with causal abstraction
Linear Probingmethod0.723
Used to evaluate representation quality across VTAB tasks
Sparse Probingmethod0.722
Method from Gurnee et al. 2023 for finding feature directions including individual neuron analysis
base model probingmethod0.721
Method of using base models (no post-training) to observe spontaneous self-referential behaviors without confound of memorized introspection language.