claim
active
claim:given-the-linear-representation-hypothesis-and-binary-linguistic-features-1d-dii-is-sufficiently-expressive-for-controlling-model-behaviour-in-causalgymGiven the linear representation hypothesis and binary linguistic features, 1D DII is sufficiently expressive for controlling model behaviour in CausalGym
Theoretical justification for the methodological choice of 1D DII throughout the benchmark
Source paper
extracted_from(2024) · Aryaman Arora · Dan Jurafsky · Christopher Potts
Neighborhood — ranked by edge-count
Papers (1)
paper
Frameworks (1)
framework
- The hypothesis that models internalize concepts as approximately linear directions in representation space; used to interpret MDS injection behavior
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.hypothesis0.765Foundation for interpreting features as linear directions.
- Interpretive claim from Case Study II about the distinction between correlational probes and causal interventions
- Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
- Cited as enabling precise behavioral control through SAE features, extending the same methodological line
- DAS consistently finds the most causally-efficacious features across all pythia model sizes in CausalGymfinding0.756Main benchmark result showing DAS superiority over probing, diff-in-means, PCA, k-means, LDA, and random
- The causal hypothesis motivating the use of causality (intervention) as the lens connecting representation and behavior geometry.
- The motivating research question of the paper