finding

active

finding:mas-iia-for-count-vs-low-cumuval-values-1-10-is-higher-than-count-vs-full-cumuval-but-still-lower-than-count-vs-rem-ops

MAS IIA for Count vs Low CumuVal (values 1-10) is higher than Count vs full CumuVal, but still lower than Count vs Rem Ops

Qualifies the arithmetic alignment results; supports hypothesis that Arithmetic GRUs use different numeric representations than incremental counting.

Source paper

extracted_from

Model Alignment Search

(2025) · Satchel Grant

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

GRUs trained on the Arithmetic task use different types of numeric representations than incremental counting models
supports
Interpretive hypothesis supported by the lower IIA between Count and Cumu Val variables even in the restricted value range.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MAS successfully aligns the Count variable from Multi-Object GRUs with the Rem Ops variable from Arithmetic GRUs with moderate IIAfinding0.799
Shows MAS can compare specific numeric variables across tasks with different domains/codomains.
MAS IIA is low for GRU hidden states vs Transformer hidden states on Multi-Object task, consistent with anti-Markovian transformer solutionfinding0.742
Validates MAS as a causal detector of representational differences invisible to correlative methods.
For small CL loss weights epsilon, IIA is maintained (potentially improved) while EMD decreases in Boundless DAS on a 7B LLMfinding0.739
Empirical result showing the CL loss can reduce divergence without sacrificing interpretability accuracy
MAS successfully aligns behavior between Multi-Object GRU models in both embedding and hidden state layers with high IIAfinding0.732
Demonstrates MAS's ability to bidirectionally transfer behavior where RSA shows low embedding correlation.
Toxic LLMs show higher IIA when compared to other toxic models than when compared to nontoxic models using stepwise MASfinding0.728
Proof-of-principle that MAS can detect model misalignment in DeepSeek-R1-Qwen-1.5B fine-tuned models.
CLMAS achieves the best IIA in the causally inaccessible (No Access) direction while matching MAS in the accessible directionfinding0.724
Demonstrates the value of the CL auxiliary loss for recovering causal alignments when one model cannot be intervened upon.
Card-counting heuristics suffice to outperform most LLMs tested.claim0.711
TrackerAgent's second-place ranking calibrates the benchmark and highlights LLM shortcomings.
Factual tasks F0-F3 reach near-perfect AUROC in early-to-mid layers of Llama-3.1-8B; arithmetic tasks A1-A3 emerge much later; counting tasks F4-F5 emerge late similar to arithmetic.finding0.710
Core empirical finding about layer-dependent truth direction emergence across task types.