claim
active
claim:anthropic-is-extremely-conservative-in-writing-up-interpretability-results-due-to-overton-window-concernsAnthropic is extremely conservative in writing up interpretability results due to Overton window concerns.
Antra's explanation for why even stronger evidence may exist but remains unpublished.
Source paper
extracted_fromNeighborhood — ranked by edge-count
Findings (1)
finding
- Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Demonstrates non-robust heuristics installed by RL; slight prompt variation recovers large gap
- Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally
- Call to extend the inference of sentience to non-biological systems as well.
- Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.