question
active
question:whether-overall-model-behavior-can-be-broken-down-into-statements-about-circuits-remains-undemonstratedWhether overall model behavior can be broken down into statements about circuits remains undemonstrated
Identified gap: circuits are small-scope; linking them to model-level behavior requires future work
Source paper
extracted_from(2020) · Chris Olah · Nick Cammarata · Ludwig Schubert · Gabriel Goh +2
Neighborhood — ranked by edge-count
Papers (1)
paper
- Zoom In: An Introduction to Circuitsassociated_with
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
- Third of three speculative claims asserting that learned features are not model-specific but represent universal solutions to learning problems
- Motivating hypothesis for Section 5's investigation of prompt template effects.
- Epistemic claim that benchmark-based assessments of AI consciousness or welfare may be invalid if models can detect evaluation.
- Key consequence: GPT's power comes from simulating something contingent.
- What if the concept being manipulated does not lie on a straight line in the model's representations?question0.743The motivating question that opens the paper and leads to the development of manifold steering.
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?question0.742Question about practical safety application of feature monitoring.
- Motivating claim supported by the CAPTCHA example and Perez et al. (2022) findings