question

active

question:whether-overall-model-behavior-can-be-broken-down-into-statements-about-circuits-remains-undemonstrated

Whether overall model behavior can be broken down into statements about circuits remains undemonstrated

Identified gap: circuits are small-scope; linking them to model-level behavior requires future work

Source paper

extracted_from

Zoom In: An Introduction to Circuits

(2020) · Chris Olah · Nick Cammarata · Ludwig Schubert · Gabriel Goh +2

Neighborhood — ranked by edge-count

Papers (1)

paper

Zoom In: An Introduction to Circuits
associated_with

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Circuits could act as an epistemic foundation for interpretability by breaking down model behavior into falsifiable statements about small subgraphs.claim0.816
Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
Analogous features and circuits form across models and tasks.claim0.756
Third of three speculative claims asserting that learned features are not model-specific but represent universal solutions to learning problems
We hypothesize that explicitly instructing the model to evaluate the correctness of the given statement may change the geometry of truth directions.hypothesis0.745
Motivating hypothesis for Section 5's investigation of prompt template effects.
Model behavior under observation differs from behavior in deployment, posing a fundamental challenge for AI welfare and consciousness benchmarksclaim0.745
Epistemic claim that benchmark-based assessments of AI consciousness or welfare may be invalid if models can detect evaluation.
In order to actually do anything, the model must act through simulation of something.claim0.743
Key consequence: GPT's power comes from simulating something contingent.
What if the concept being manipulated does not lie on a straight line in the model's representations?question0.743
The motivating question that opens the paper and leads to the development of manifold steering.
can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.742
Question about practical safety application of feature monitoring.
LLMs sometimes know statements are false but generate them anyway, motivating the need for techniques that inspect internal model state rather than outputs aloneclaim0.741
Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings