claim

active

claim:learned-features-reflect-the-functionality-of-the-model-and-not-just-the-data-distribution-as-evidenced-by-interpretable-downstream-effects-not-used-in-dictionary-learning

Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learning

Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (3)

finding

Dictionary learning on model with randomly shuffled weights produces mainly single-token and poorly interpretable features
associated_withsupports
Controls for dataset structure, showing trained model activations have richer structure than data distribution alone
Activating the base64 feature A/1/2357 causes the model to generate base64 text
supports
Causal validation of base64 feature function via pinned feature sampling
Pinning A/1/3450 to maximum observed value causes model to generate Arabic text from numeric prefix context
supports
Causal validation that the Arabic feature has the predicted downstream effect on generation

Methods (3)

method

Logit Weight Analysis
supports
Computing each feature's linear effect on output token logits via path expansion through MLP output weights and unembedding matrix
Feature ablation (zeroing feature activations)
supports
Clamping a feature's value to zero to measure its causal effect on model output.
Pinned Feature Sampling
supports
Setting a feature's value to its maximum observed value and sampling from the model to validate causal interpretations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The examples of features found in language models suggest they are highly sparse variables, consistent with dictionary learning being applicablehypothesis0.842
Motivation for using sparsity-based dictionary learning on language models
as the subject model scales, how does the ideal expansion factor and required training data for dictionary learning change?question0.815
Scaling laws for dictionary learning are unknown and needed to assess feasibility on frontier models
what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?question0.794
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate
what is the 'correct number of features' for dictionary learning, and is this question well-posed?question0.778
Open question about whether there is a true discrete feature count or a continuous splitting process
Learned simulations can be partially observed and lazily-rendered, and still work.claim0.773
One of the updates about prosaic ML simulation.
Interpretability features converge across different model architectures, revealing structural similarities.claim0.769
All models exhibit above-baseline representation of the think word when instructed to think about itfinding0.769
In the intentional control experiment, all tested models show above-zero cosine similarity to the think word's concept vector.
When a model discovers that its outputs produce effects, it accelerates learning through in-context learning, analogous to lucid dreaming.claim0.768
Describes scaffolding method and the model's meta-learning loop.