question

active

question:what-features-activate-on-tokens-we-d-expect-to-signify-claude-s-self-identity

what features activate on tokens we'd expect to signify Claude's self-identity?

Open question from the discussion on future research directions.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

what features activate when we ask questions probing Claude's goals and values?question0.812
Direction for understanding model's internal objectives via features.
what features activate when Claude is trained to be a sleeper agent?question0.806
Question posed after discussing sleeper agent threat model.
what features activate when we ask Claude questions about its subjective experience?question0.783
Question about features related to consciousness and self-report.
Self-identity featuresconcept0.756
Features that activate when the model is asked about itself, invoking AI tropes and anthropomorphization.
what features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons?question0.742
Potential safety claim about suppressing features to prevent CBRN advice.
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.736
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
The basic hallmarks of being a Self are the ability to pursue goals, to own compound memories, and to serve as the locus for credit assignment, at a scale larger than any component.claim0.720
Proposed operational definition of a Self within the TAME framework.
Memories are not immutable markers of identity; they can be transferred between individuals and remapped onto new substrates.claim0.712
Evidence from planarian tail fragment training and metamorphosis suggests memory is substrate-independent process.