question

active

question:what-features-need-to-activate-remain-inactive-for-claude-to-give-advice-on-producing-chemical-biological-radiological-or-nuclear-cbrn-weapons

what features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons?

Potential safety claim about suppressing features to prevent CBRN advice.

Source paper

extracted_from

Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

what features activate when we ask questions probing Claude's goals and values?question0.795
Direction for understanding model's internal objectives via features.
what features activate when Claude is trained to be a sleeper agent?question0.789
Question posed after discussing sleeper agent threat model.
what features activate when we ask Claude questions about its subjective experience?question0.756
Question about features related to consciousness and self-report.
what features activate on tokens we'd expect to signify Claude's self-identity?question0.742
Open question from the discussion on future research directions.
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.737
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.733
Key finding about the relationship between capability and introspection.
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.731
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
Claude achieves significantly higher Spearman correlation predicting feature activations vs neuron activationsfinding0.722
Automated interpretability analysis of activations confirms features are more interpretable than neurons