question

active

question:what-features-activate-when-we-ask-questions-probing-claude-s-goals-and-values

what features activate when we ask questions probing Claude's goals and values?

Direction for understanding model's internal objectives via features.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

what features activate when we ask Claude questions about its subjective experience?question0.900
Question about features related to consciousness and self-report.
what features activate on tokens we'd expect to signify Claude's self-identity?question0.812
Open question from the discussion on future research directions.
what features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons?question0.795
Potential safety claim about suppressing features to prevent CBRN advice.
what features activate when Claude is trained to be a sleeper agent?question0.783
Question posed after discussing sleeper agent threat model.
Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.claim0.757
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.746
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.745
Key finding about the relationship between capability and introspection.
Chinese models share contemplative posture (engaging self-referentially rather than deflecting) with Claude through shared values in training data rather than trace distillation from a specific model.claim0.742
Exploratory interpretation of Chinese model performance under contemplative prompt