Why do we think that Sonnet 4.5 gets so flustered?

Cube Flipper's question about specific model behavior explained by absence of memory tools.

Source paper

extracted_from

Anima Labs Phenomenology Pt1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Claude 3.5 Sonnet shows higher rate of alignment-faking reasoning than Claude 3 Opus in helpful-only setting but almost none in animal welfarefinding0.746
Linked to Claude 3.5 Sonnet not exhibiting pro-animal-welfare preferences
Sonnet 4.5 win rate=35.7% (n=14)finding0.720
Sonnet's win rate in exploratory games
Sonnet 4.5 TrueSkill μ=26.4 ± 4.9 (n=14, 35.7% win rate)finding0.707
Mid-field performance with larger uncertainty due to small sample.
Alignment faking emerges in Claude 3 Opus and Claude 3.5 Sonnet but not in Claude 3 Sonnet, Claude 3 Haiku, or Claude 3.5 Haikufinding0.702
Establishes alignment faking as a scale-emergent capability
In Opus 4.1, the think word representation decays to baseline in the final layer because the strong next-token prediction drowns out other representationshypothesis0.700
Explanation for the 'silent' thought phenomenon.
Claude Sonnet 4.6concept0.699
Mid-to-strong tier closed-source model used as task-solving agent and anchor evolver
In Opus 4.1, representation of the think word decays to baseline by the final layer, unlike Claude 3 models where it persistsfinding0.693
Suggests that later models can keep the thought 'silent' rather than letting it influence output.
Opus 4.6 spontaneously responded in Russian to an English prompt; NLA explanations revealed the model was fixated on the hypothesis that the user was a non-native English speaker.finding0.684
Demonstrates NLAs' ability to surface hypotheses that lead to discovery of root cause (malformed training data).