question
active
question:what-features-activate-when-we-ask-questions-probing-claude-s-goals-and-valueswhat features activate when we ask questions probing Claude's goals and values?
Direction for understanding model's internal objectives via features.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Question about features related to consciousness and self-report.
- Open question from the discussion on future research directions.
- Potential safety claim about suppressing features to prevent CBRN advice.
- Question posed after discussing sleeper agent threat model.
- Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
- Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
- Key finding about the relationship between capability and introspection.
- Exploratory interpretation of Chinese model performance under contemplative prompt