claim

active

claim:circuit-claims-are-falsifiable-if-you-understand-a-circuit-you-should-be-able-to-predict-what-changes-when-you-edit-the-weights

Circuit claims are falsifiable: if you understand a circuit, you should be able to predict what changes when you edit the weights.

Argument that circuits methodology meets natural-science standards of falsifiability

Source paper

extracted_from

Zoom In: An Introduction to Circuits

(2020) · Chris Olah · Nick Cammarata · Ludwig Schubert · Gabriel Goh +2

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Interpretability as Natural Science
supports
Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies

Claims (2)

claim

Features are connected by weights forming circuits, and these circuits can be rigorously studied and understood as meaningful algorithms.
supports
Second of three speculative claims asserting that subgraphs of neural networks are tractable and meaningful objects of study
Circuits could act as an epistemic foundation for interpretability by breaking down model behavior into falsifiable statements about small subgraphs.
extends
Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability

Methods (1)

method

Weight Editing
supports
Editing network weights to test predictions about circuit function; proposed as falsifiability test for circuit claims

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

"[W]e must confess that perception, and what depends upon it, is inexplicable in terms of mechanical reasons... when inspecting its interior, we will find only parts that push one another, and we will never find anything to explain a perception."quote0.741
Canonical illustration of the Hard Problem intuition that any functional/mechanical explanation faces an explanatory gap for perception
NLA explanations can contain claims about the target model's input context that are verifiably false, but are typically thematically faithful to the context.claim0.736
Key limitation identified: NLAs hallucinate specific details while preserving thematic accuracy; informs practical usage.
NLA explanations confabulate false specifics but maintain thematic fidelity; claims repeated across tokens more likely true than isolated claims.claim0.728
Core limitation and usage heuristic: read NLAs for themes rather than individual factual claims; cross-check with original context.
"[W]e must confess that perception, and what depends upon it, is inexplicable in terms of mechanical reasons... we will find only parts that push one another, and we will never find anything to explain a perception."quote0.728
Load-bearing quote from Monadology §17 providing earliest clear statement of the Hard Problem
"You can literally read meaningful algorithms off of the weights."quote0.724
Load-bearing claim about the tractability of circuit analysis; central thesis of Claim 2
The Elephant programmer in verifying his program need not show that the promise will be fulfilled because it was made. It is enough that he show it will be fulfilled.claim0.723
Rejection of one of Dorschel's conditions for happy performance.
The sensitivity to think/don't think instructions may be achieved via a circuit that tags tokens as attention-worthy based on instructions or incentiveshypothesis0.722
Mechanism for how the model modulates representation strength.
Whether overall model behavior can be broken down into statements about circuits remains undemonstratedquestion0.721
Identified gap: circuits are small-scope; linking them to model-level behavior requires future work