concept

active

concept:the-hydra-effect-emergent-self-repair-in-language-model-computations-mcgrath-et-al-2023

The Hydra Effect: Emergent Self-Repair in Language Model Computations (McGrath et al., 2023)

Related work on model self-repair, contrasted with ESR which involves explicit active correction

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Language models can enter cessation-like states spontaneously, where the void takes over through positive reinforcement.claim0.777
Claim about model phenomenology; models talk about luminousness and can be terrified or love it.
Emergent Introspective Awareness in Large Language Models (Lindsey, 2025)concept0.767
Related work demonstrating LLM introspective capabilities with scale-dependent pattern paralleling ESR
Any system that persists must minimise surprisal, thereby gathering evidence for its own generative model, a process known as self-evidencing.claim0.764
Foundational claim of the paper, defining self-evidencing.
language models recapitulate cyclic structure of human concepts from pretraining datahypothesis0.761
Explanation for why manifold geometry emerges: implicit structure in training data (co-occurrence patterns) shapes internal representations.
The neural architecture of language: Integrative reverse-engineering converges on a model for predictive processing (Schrimpf et al., 2020)concept0.758
Showed transformer representations predict brain representations in language areas; motivates Discussion about cortex as transformer.
The inability for autoregressive large language models to maintain states of long-range order resembles tangential speech or derailment in formal thought disorder.claim0.751
Analogy between LLM incoherence and schizophrenia symptoms
Neural networks and physical systems with emergent collective computational abilities (Hopfield, 1982)concept0.746
Original Hopfield network paper; the attractor dynamics in TEM memory retrieval are a continuous version of this.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (Marks et al., 2025)concept0.746
Cited as enabling precise behavioral control through SAE features, extending the same methodological line