Probing classifiers: Promises, shortcomings, and advances

ByYonatan Belinkov

DOI 10.1162/coli_a_00422 arXiv 2102.12452

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label Information
Sehyun Lee, Jaesik Choi Youngju Joung
2025
≈ 77%
Probing Classifiers are Unreliable for Concept Removal and Detection
Chenhao Tan, Amit Sharma Abhinav Kumar
2023
≈ 75%
Flexible text generation for counterfactual fairness probing
Vera Axelrod, Ben Packer, Alex Beutel, Jilin Chen, Kellie Webster Zee Fryer
2022
≈ 75%
Probing Task-Oriented Dialogue Representation from Language Models
Chien-Sheng Wu and Caiming Xiong
2020
≈ 75%
ShortcutProbe: Probing Prediction Shortcuts for Learning Robust Models
Wenqian Ye, Aidong Zhang Guangtao Zheng
2025
≈ 74%
Black Box to White Box: Discover Model Characteristics Based on Strategic Probing
Matthew Ciolino, David Noever, Gerry Dozier Josh Kalin
2020
≈ 74%
Is Probing All You Need? Indicator Tasks as an Alternative to Probing Embedding Spaces
Omer Goldman and Reut Tsarfaty Tal Levy
2023
≈ 74%
When Respondents Don't Care Anymore: Identifying the Onset of Careless Responding
Max Welz and Andreas Alfons
2026
≈ 74%
Clarifying the Path to User Satisfaction: An Investigation into Clarification Usefulness
Xi Wang, Mohammad Aliannejadi, Mohammadmehdi Naghiaei, Emine Yilmaz Hossein A. Rahmani
2026
≈ 74%
Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model
Quentin Renau and Emma Hart
2025
≈ 74%
Sentiment analysis is not solved! Assessing and probing sentiment classification
Lilja {\O}vrelid, Erik Velldal Jeremy Barnes
2019
≈ 73%
Probing for Knowledge Attribution in Large Language Models
Alexander Boer, Dennis Ulmer Ivo Brink
2026
≈ 73%
Hallucinations Live in Variance
Shawn P. Chadwick Aaron R. Flouro
2026
≈ 73%
DirectProbe: Studying Representations without Classifiers
Yichu Zhou and Vivek Srikumar
2021
≈ 73%
Probing Structural Mathematical Reasoning in Language Models with Algebraic Trapdoors
Igor Rivin
2026
≈ 73%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 70%
Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
in corpus
2026
≈ 70%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 68%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 68%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 68%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 67%
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
in corpus
2025
≈ 67%
Simulators — LessWrong
in corpus
≈ 67%
Exploration Through Introspection: A Self-Aware Reward Model
in corpus
2026
≈ 67%
Active Inference, Curiosity and Insight
in corpus
2017
≈ 67%
The Platonic Representation Hypothesis
in corpus
2024
≈ 67%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 67%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 67%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 67%

Similar preprints — Semantic Scholar

Cited by (9)

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Reasoning models generate chains of thought that are frequently performative rather than causally necessary for reaching the correct answer: on MMLU recall questions, activation probes decode the mode
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as