question
active
question:do-distinctions-in-estimates-remain-robust-across-diverse-tom-stimuli-in-repeated-large-scale-trialsDo distinctions in Φ estimates remain robust across diverse ToM stimuli in repeated large-scale trials?
Criterion 2 operationalization: requires p<0.05 in Wilcoxon tests across score categories.
Source paper
extracted_from(2025) · Li, Jingkai
Neighborhood — ranked by edge-count
Claims (1)
claim
- Primary conclusion of the study based on temporal permutation analysis failing all three criteria.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Criterion 1 operationalization: requires >80% 'good' cases (higher score → higher Φ) per ToM task.
- Second of three operational criteria; requires distributional significance in IIT estimates across performance levels.
- First of three operational criteria for identifying consciousness phenomena in LLM representations.
- Criterion 3 operationalization: requires IIT mean AUC to exceed Span Representation mean AUC.
- Main interpretive finding from Criterion 3 comparison showing Span Representation consistently outperforms IIT under temporal permutation.
- Specific prediction linking IIT's prediction of high Φ for good performance to the experimental design's scoring structure.
- Key limitation and open question about experimental scope.
- Even the rare cases where good > bad do not reach the 80% significance threshold required by Criterion 1.