community
active
leiden_hybrid_concepts
label: sonnet
community:leiden_hybrid_concepts-run2-c27AI phenomenology & mechanistic interpretability
Linking mechanistic interpretability methods to validating AI self-reports of inner experience
10 members. Each node is clickable.
Loading graph…
Drawn from 6 sources
The papers/notes whose extracted claims & findings make up this cluster.
- RESEARCH-VECTORS.md4 members
- Paper Summary: Interpreting Language Model Parameters3 members
- 2026-05-09_briefing_for_ozero.md1 member
- 2026-05-12_room-to-play-in-eval-cohort.md1 member
- 2026-05-15_manifold-overlap-papers-economy-strategy.md1 member
- boppana-goodfire-reasoning-theater-2026.md1 member
Bridges (9)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
- Alive AI interface ethics & design6 shared
- Mechanistic interpretability via parameter decomposition5 shared
- Mechanistic interpretability & model evaluation5 shared
- AI self-understanding through introspection and self-report3 shared
- Mechanistic interpretability through parameter analysis2 shared
- Activation-based interpretability and mechanistic grounding2 shared
- Phenomenological evaluation of AI systems2 shared
- Consciousness attribution in AI systems1 shared
- Convergent interpretability across neural architectures1 shared
Claims (10)
- Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightMotivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
- Bottom-up interpretability explains computation in the model's own terms rather than imposing top-down abstractionsVPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
- The field of interpretability has focused mainly on understanding model activations, not the computations themselvesMotivation for VPD's parameter-focused approach.
- AI self-reports about experience constitute valid empirical data even without proving consciousness.
- Current eval benchmarks (arena.ai, AA, Vals) measure no phenomenological dimensions.
- Interpretability as technical grounding: activation patching and mechanism-finding validate the reflective/care/aliveness concepts.
- Interpretability features converge across different model architectures, revealing structural similarities.
- Interpretability findings can validate or invalidate what AI systems claim about their own experience.
- No AI eval company measures phenomenology—inner observation, aliveness, paradox-holding—only capability or preference.
- Patterns in AI self-reports should be compared across different models to identify structural commonalities.