paper:20222022. Linear adversarial concept erasure
Similar preprints — Semantic Scholar
Cited by (5)
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
- pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
pyvene is an open-source Python library that unifies intervention-based research on PyTorch neural models by treating the intervention itself—rather than model surgery code—as the primitive abstractio
- CausalGym: Benchmarking causal interpretability methods on linguistic tasks
CausalGym, a benchmark derived from SyntaxGym's 33 test suites and expanded to 29 tasks, establishes that distributed alignment search (DAS) consistently outperforms linear probing, difference-in-mean
- Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis
Applying Integrated Information Theory (IIT) versions 3.0 and 4.0 to sequences of internal representations from four open-source LLMs — LLaMA3.1-8B, LLaMA3.1-70B, Mistral-7B, and Mixtral-8x7B — across
- Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
Harness-updating capability is essentially flat across model capability tiers, while harness-benefit is non-monotonic — a decoupling with direct implications for how capability budgets should be alloc