Ridge Regression on Message Embeddings

Predicting Assistant Axis projections from L2-normalized Qwen 3 0.6B embeddings of user messages via ridge regression

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Ridge Regression Probingmethod0.793
Ridge regression fit on top-256 PCs of Gemini embeddings to predict model layer-40 activations and compute residuals
Ridge regression probe constructionmethod0.784
Method used to predict model activations from Gemini embeddings and compute residuals for probe construction
User message embeddings predict subsequent model Assistant Axis projection with R2=0.53-0.77 (p<0.001) but predict delta from previous response with only R2=0.10finding0.734
Shows model persona position is primarily determined by the most recent user message, not prior drift
Input Embedding Similarity Baselinemethod0.705
Baseline method for instruction discovery using surface-level input embedding similarity instead of steering vectors.
Message Passing Inferenceconcept0.701
Algorithmic framework for probabilistic inference in graphical models.
span embedding analysismethod0.695
Extracting embeddings from instruction and example spans.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.693
Core methodology paper for SAE-based interpretable feature extraction
learnable hidden embeddingsconcept0.690
The component used in latent reasoning to perform internal computation.