paper:belinkov-probing-classifiers-promises-shortcoming-2022Probing classifiers: Promises, shortcomings, and advances
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label InformationSehyun Lee, Jaesik Choi Youngju Joung2025≈ 77%
- Probing Classifiers are Unreliable for Concept Removal and DetectionChenhao Tan, Amit Sharma Abhinav Kumar2023≈ 75%
- Flexible text generation for counterfactual fairness probingVera Axelrod, Ben Packer, Alex Beutel, Jilin Chen, Kellie Webster Zee Fryer2022≈ 75%
- Probing Task-Oriented Dialogue Representation from Language ModelsChien-Sheng Wu and Caiming Xiong2020≈ 75%
- ShortcutProbe: Probing Prediction Shortcuts for Learning Robust ModelsWenqian Ye, Aidong Zhang Guangtao Zheng2025≈ 74%
- Black Box to White Box: Discover Model Characteristics Based on Strategic ProbingMatthew Ciolino, David Noever, Gerry Dozier Josh Kalin2020≈ 74%
- Is Probing All You Need? Indicator Tasks as an Alternative to Probing Embedding SpacesOmer Goldman and Reut Tsarfaty Tal Levy2023≈ 74%
- When Respondents Don't Care Anymore: Identifying the Onset of Careless RespondingMax Welz and Andreas Alfons2026≈ 74%
- Clarifying the Path to User Satisfaction: An Investigation into Clarification UsefulnessXi Wang, Mohammad Aliannejadi, Mohammadmehdi Naghiaei, Emine Yilmaz Hossein A. Rahmani2026≈ 74%
- Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier ModelQuentin Renau and Emma Hart2025≈ 74%
- Sentiment analysis is not solved! Assessing and probing sentiment classificationLilja {\O}vrelid, Erik Velldal Jeremy Barnes2019≈ 73%
- Probing for Knowledge Attribution in Large Language ModelsAlexander Boer, Dennis Ulmer Ivo Brink2026≈ 73%
- ≈ 73%
- ≈ 73%
- ≈ 73%
- ≈ 70%
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Trainingin corpus2026≈ 70%
- ≈ 68%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 68%
- ≈ 68%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 67%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 67%
- Simulators — LessWrongin corpus≈ 67%
- ≈ 67%
- Active Inference, Curiosity and Insightin corpus2017≈ 67%
- The Platonic Representation Hypothesisin corpus2024≈ 67%
- ≈ 67%
- ≈ 67%
- ≈ 67%
Similar preprints — Semantic Scholar
Cited by (9)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie
- Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Reasoning models generate chains of thought that are frequently performative rather than causally necessary for reaching the correct answer: on MMLU recall questions, activation probes decode the mode
- Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B solves cyclic arithmetic (e.g., "what month is six months after August?") not by performing modular addition in the period of the cyclic concept (12 for months, 7 for days of the week) as