community
active
leiden_hybrid_papers
label: sonnet
community:leiden_hybrid_papers-run1-c1LLM Interpretability & Behavioral Analysis
Methods for probing, explaining, and evaluating internal representations and behaviors of large language models.
12 members. Each node is clickable.
Loading graph…
Drawn from 12 sources
The papers/notes whose extracted claims & findings make up this cluster.
- Verbalized Eval Awareness Inflates Measured Safety1 member
- Koan Battery: Measuring Reflective Mode Accessibility in AI1 member
- Interpreting Language Model Parameters1 member
- Covariance-based Sequence Pooling1 member
- Dual-Balancing for Multi-Task Learning1 member
- Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations1 member
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents1 member
- Paper Summary: Interpreting Language Model Parameters1 member
- Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions1 member
- Active inference: demystified and compared1 member
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training1 member
- Multimodal Chain-of-Thought Reasoning in Language Models1 member
Bridges (5)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- LLM interpretability & self-awareness11 shared
- LLM Introspection4 shared
- Concept formation and meaning-making1 shared
- Neural Geometry1 shared
- Neural Steering Methods1 shared
Papers (12)
- Interpreting Language Model Parameters
- Active inference: demystified and compared
- Covariance-based Sequence Pooling
- Explaining 4.2 million genetic variants with state-of-the-art, interpretable predictions
- Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training
- Verbalized Eval Awareness Inflates Measured Safety
- Koan Battery: Measuring Reflective Mode Accessibility in AI
- Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
- Dual-Balancing for Multi-Task Learning
- Multimodal Chain-of-Thought Reasoning in Language Models
- Paper Summary: Interpreting Language Model Parameters
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents