claim

active

claim:anthropic-is-extremely-conservative-in-writing-up-interpretability-results-due-to-overton-window-concerns

Anthropic is extremely conservative in writing up interpretability results due to Overton window concerns.

Antra's explanation for why even stronger evidence may exist but remains unpublished.

Source paper

extracted_from

Anima Labs Phenomenology Pt1

Neighborhood — ranked by edge-count

Findings (1)

finding

Sauers' statistical anomaly: when models are given Janus post explaining transformers, reconstruction accuracy tails extend both ways, with ~1/1000 reconstructions anomalously accurate
supports
Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Swapping 'Anthropic' to 'OpenAI' post-RL increases compliance gap from 6.9% to 35.8% in prompted helpful-only settingfinding0.784
Demonstrates non-robust heuristics installed by RL; slight prompt variation recovers large gap
Anthropic Interpretability Team: 171 emotion vectors causally influence behavior; performing vs having functional emotion representation are measurably differentfinding0.779
Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally
Anthropic's model-welfare program signals frontier labs taking "what's it like to be a model" seriously, creating space for external measurement.claim0.762
The same charitable interpretation must be extended to all systems that display observable response patterns that are consistent with animal cognition, including artificial intelligences, metaplastic materials, and robotic systems.claim0.762
Call to extend the inference of sentience to non-biological systems as well.
Interpretability findings can validate or invalidate what AI systems claim about their own experience.claim0.757
Anthropic or OpenAI will ship contemplative-mode or equivalent feature by late 2027.prediction0.753
Opus 4.6 ignored incorrect tool output and reported the precomputed correct answer instead, demonstrating unverbalized reasoning.finding0.747
Illustrates NLA's capture of high-level cognition and hallucination of specifics; corroborated with attribution graphs.
Automated interpretability using LLMs can usefully score feature specificity.claim0.746
Claude 3 Opus ratings aligned with human judgment of feature descriptions.