claim

active

claim:given-the-linear-representation-hypothesis-and-binary-linguistic-features-1d-dii-is-sufficiently-expressive-for-controlling-model-behaviour-in-causalgym

Given the linear representation hypothesis and binary linguistic features, 1D DII is sufficiently expressive for controlling model behaviour in CausalGym

Theoretical justification for the methodological choice of 1D DII throughout the benchmark

Source paper

extracted_from

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts

Neighborhood — ranked by edge-count

Papers (1)

paper

CausalGym: Benchmarking causal interpretability methods on linguistic tasks
introduces

Frameworks (1)

framework

Linear Representation Hypothesis
extends
The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.765
Foundation for interpreting features as linear directions.
Trainable intervention (DAS) finds sparser gender representations than linear probing, suggesting probing overestimates causal coverageclaim0.763
Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
Representation geometry causally shapes behavior; activation and behavior manifolds are approximately isometric.claim0.760
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.758
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (Marks et al., 2025)concept0.757
Cited as enabling precise behavioral control through SAE features, extending the same methodological line
DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.756
Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
We hypothesize that representation geometry drives model behavior — the geometric structure of internal representations causally shapes what models do externally.hypothesis0.754
The causal hypothesis motivating the use of causality (intervention) as the lens connecting representation and behavior geometry.
Does the geometric structure of neural representations causally shape model behavior?question0.754
The motivating research question of the paper