question
active
question:what-features-activate-on-tokens-we-d-expect-to-signify-claude-s-self-identitywhat features activate on tokens we'd expect to signify Claude's self-identity?
Open question from the discussion on future research directions.
Source paper
extracted_fromRelated by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Direction for understanding model's internal objectives via features.
- Question posed after discussing sleeper agent threat model.
- Question about features related to consciousness and self-report.
- Features that activate when the model is asked about itself, invoking AI tropes and anthropomorphization.
- Potential safety claim about suppressing features to prevent CBRN advice.
- Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
- Proposed operational definition of a Self within the TAME framework.
- Evidence from planarian tail fragment training and metamorphosis suggests memory is substrate-independent process.