Cross-task generalization evaluation

Measuring AUROC of a probe trained on one task when evaluated on another task to assess universality.

Neighborhood — ranked by edge-count

concept

AUROC
implements
Performance metric for binary classification; used to evaluate pathogenicity prediction.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

task generalizationconcept0.852
The ability to generalize across tasks; lacking in latent methods.
Cross-Architecture Generalizationconcept0.799
Whether learned cones transfer effectively across model families (Qwen vs Gemma) and sizes
Generalizationconcept0.792
Ability to apply learned solutions to novel circumstances.
scope generalizationconcept0.770
Generalization from 2-digit to 3-4 digit arithmetic; limited by mismatch dr.
Generalisationconcept0.762
Ability to respond appropriately to novel situations based on past regularities; fundamental to learning and intelligence.
Cross-Judge Analysismethod0.758
Validation of judge model robustness by regrading 1000 responses with 4 additional judge models
Grid Scaling Generalization Testmethod0.752
Evaluation of learned circuits on grids 4x larger with 4x more steps than training conditions
Probe Generalizationconcept0.750
The ability of probes trained on one dataset to transfer accurately to topically and structurally different datasets