framework
active
framework:automated-interpretability

Automated Interpretability

Method using large language models (Claude) to generate and test explanations of features at scale

Neighborhood — ranked by edge-count

Thinkers (1)

thinker
  • Developed automated interpretability approach using LLMs to explain neuron activations

Methods (1)

method
  • Importance Scoring
    associated_with
    Weighted Spearman correlation that corrects for sampling bias in automated interpretability evaluation

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

  • interpretabilityconcept0.886
    The capability to explain model predictions; a central theme of the paper, with disruption profiles as vehicle.
  • Proposed paradigm for evaluating interpretability work through empirical falsifiability rather than benchmarks or user studies
  • Using Claude 3 Opus to generate feature explanations and predict held-out activations.
  • Advantage of DiffLogic CA over NCA — learned rules are pure binary logic circuits that can be visualized and analyzed
  • Cases where subspace interventions change model behaviour through parallel pathways rather than the target feature
  • Programming technique to restructure a fine-grained Linda program for efficiency by replacing live data structures with passive ones and coarser-grain processes.
  • The field aimed at understanding what neural networks have learned; characterized as pre-paradigmatic in this paper
  • An interpretability paradigm that explains computation in the model's own terms, rather than imposing top-down abstractions; VPD aims to realize this.