finding

active

finding:automated-logit-weight-prediction-achieves-74-mean-accuracy-for-features-vs-58-for-neurons-vs-50-chance

Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chance

Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence
supports
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Logit lens prediction accuracy is near-chance at layer 4 (28%) after injection at L2, α=6finding0.799
Shows that signal integration into explicit prediction has barely begun immediately after injection
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.786
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias
Logit-based self-report achieves 3.1–3.7 bits entropy vs 0.03–1.10 bits greedy and 0.68–2.05 bits sampled in LLaMA-3.2-3Bfinding0.765
Quantifies the information gain from using logit-based expected value over greedy or sampled decoding
Apparent success on binary detection tasks is entirely explained by mechanical logit shifts that bias models toward affirmative responses regardless of question contentclaim0.759
Primary negative finding reinterpreted as methodological claim: binary paradigm is invalid for testing introspection
A/5 autoencoder (131,072 features) recovers 94.5% of MLP log-likelihood loss reductionfinding0.753
Shows that loss recovery increases with autoencoder size
The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.finding0.751
Quantitative relationship between concept frequency and feature presence.
Feature pair A/1/3949 and B/1/3321 have activation correlation 0.98 but negative logit weight correlation, firing on PLOSOne journal abbreviationsfinding0.749
Demonstrates that activation similarity can diverge from logit weight similarity due to interference
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.finding0.744
Table 2, row 3, showing equivalence when prior preferences match rewards.