concept

active

concept:interpretability-in-the-wild-a-circuit-for-indirect-object-identification-in-gpt-2-small-wang-et-al-2023

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small (Wang et al., 2023)

Cited as causal intervention methodology precedent for this paper's ablation approach

Neighborhood — ranked by edge-count

Papers (1)

paper

Endogenous Resistance to Activation Steering in Language Models
cites

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT's ability to simulate text automata is the source of its most surprising and pivotal implications for paths to superintelligence.claim0.766
Importance of recursive generation.
GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra.claim0.758
Central thesis of the post.
GPT-2 implements at least one induction head using pointer arithmetic on positional embeddings rather than K-compositionhypothesis0.751
Observation of an alternative induction head implementation algorithm in larger models with positional embeddings in the residual stream
The field of interpretability has focused mainly on understanding model activations, not the computations themselvesclaim0.746
Motivation for VPD's parameter-focused approach.
Interpretability as technical grounding: activation patching and mechanism-finding validate the reflective/care/aliveness concepts.claim0.744
The key challenge for active inference is finding the generative model that best explains observable data.claim0.742
Identifies an outstanding problem, Section 10.
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.741
Central thesis of the paper
Automated interpretability using LLMs can usefully score feature specificity.claim0.741
Claude 3 Opus ratings aligned with human judgment of feature descriptions.