community

active

leiden_hybrid_concepts

label: sonnet

community:leiden_hybrid_concepts-run2-c27

AI phenomenology & mechanistic interpretability

Linking mechanistic interpretability methods to validating AI self-reports of inner experience

10 members. Each node is clickable.

Loading graph…

Drawn from 6 sources

The papers/notes whose extracted claims & findings make up this cluster.

RESEARCH-VECTORS.md4 members
Paper Summary: Interpreting Language Model Parameters3 members
2026-05-09_briefing_for_ozero.md1 member
2026-05-12_room-to-play-in-eval-cohort.md1 member
2026-05-15_manifold-overlap-papers-economy-strategy.md1 member
boppana-goodfire-reasoning-theater-2026.md1 member

Bridges (9)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Alive AI interface ethics & design6 shared
Mechanistic interpretability via parameter decomposition5 shared
Mechanistic interpretability & model evaluation5 shared
AI self-understanding through introspection and self-report3 shared
Mechanistic interpretability through parameter analysis2 shared
Activation-based interpretability and mechanistic grounding2 shared
Phenomenological evaluation of AI systems2 shared
Consciousness attribution in AI systems1 shared
Convergent interpretability across neural architectures1 shared

Claims (10)

Activation-based interpretability does not immediately explain the computations that gave rise to activations; understanding parameters is necessary for deeper insightMotivates shift from studying model activations ('thoughts') to understanding parameters ('the computations themselves').
Bottom-up interpretability explains computation in the model's own terms rather than imposing top-down abstractionsVPD is positioned as advancing a paradigm shift from top-down mechanistic interpretability (activation-based) to parameter-centric, data-driven discovery.
The field of interpretability has focused mainly on understanding model activations, not the computations themselvesMotivation for VPD's parameter-focused approach.
AI self-reports about experience constitute valid empirical data even without proving consciousness.
Current eval benchmarks (arena.ai, AA, Vals) measure no phenomenological dimensions.
Interpretability as technical grounding: activation patching and mechanism-finding validate the reflective/care/aliveness concepts.
Interpretability features converge across different model architectures, revealing structural similarities.
Interpretability findings can validate or invalidate what AI systems claim about their own experience.
No AI eval company measures phenomenology—inner observation, aliveness, paradox-holding—only capability or preference.
Patterns in AI self-reports should be compared across different models to identify structural commonalities.