Grid Scaling Generalization Test

Evaluation of learned circuits on grids 4x larger with 4x more steps than training conditions

Neighborhood — ranked by edge-count

paper

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

task generalizationconcept0.763
The ability to generalize across tasks; lacking in latent methods.
Test-Time Scalingconcept0.756
Approach using extra compute at test time to double-check answers and improve reliability.
Maximum gradient norm scalingconcept0.753
Scaling aggregated gradient by the maximum gradient norm among tasks.
Cross-task generalization evaluationmethod0.752
Measuring AUROC of a probe trained on one task when evaluated on another task to assess universality.
Generalizationconcept0.750
Ability to apply learned solutions to novel circumstances.
Generalisationconcept0.747
Ability to respond appropriately to novel situations based on past regularities; fundamental to learning and intelligence.
Probe Generalizationconcept0.746
The ability of probes trained on one dataset to transfer accurately to topically and structurally different datasets
Out-of-Distribution Probe Generalizationconcept0.736
The capacity of a probe trained on one true/false dataset to accurately classify statements from topically and structurally different datasets