concept

active

concept:criterion-2-statistically-significant-value-differences-p-0-05-across-tom-score-categories-via-wilcoxon-test

Criterion 2: Statistically significant Φ value differences (p<0.05) across ToM score categories via Wilcoxon test.

Second of three operational criteria; requires distributional significance in IIT estimates across performance levels.

Neighborhood — ranked by edge-count

Concepts (2)

concept

Criterion 1: Φ estimates must yield >80% 'good' cases (higher score → higher Φ) per ToM task to indicate potential consciousness.
associated_with
First of three operational criteria for identifying consciousness phenomena in LLM representations.
Criterion 3: IIT estimates must achieve higher mean AUC than Span Representation for ToM score classification.
associated_with
Third of three operational criteria; distinguishes consciousness from inherent LLM representational separations.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Do distinctions in Φ estimates remain robust across diverse ToM stimuli in repeated large-scale trials?question0.782
Criterion 2 operationalization: requires p<0.05 in Wilcoxon tests across score categories.
Can estimates of Φ, the primary metric of IIT, robustly differentiate responses across distinct ToM performance levels?question0.775
Criterion 1 operationalization: requires >80% 'good' cases (higher score → higher Φ) per ToM task.
Wilcoxon Testmethod0.769
Non-parametric statistical test used to assess significance of Φ differences between ToM score categories.
If 'consciousness' phenomenon can be observed from ToM-related RN, higher ToM test scores should yield higher values of μΦmax (IIT 3.0) and/or μΦ (IIT 4.0).hypothesis0.768
Specific prediction linking IIT's prediction of high Φ for good performance to the experimental design's scoring structure.
Variations in ToM test score categories are more likely attributed to span-level information of the LLM representation sequence rather than to a 'consciousness' phenomenon as suggested by IIT estimates.claim0.751
Main interpretive finding from Criterion 3 comparison showing Span Representation consistently outperforms IIT under temporal permutation.
None of the cases identified under temporal permutation satisfy the Criterion 1 threshold of >80% 'good' cases for any ToM task.finding0.745
Even the rare cases where good > bad do not reach the 80% significance threshold required by Criterion 1.
At layer 12 (the layer analyzed by Burger et al. 2024), tP and tG explain similar fractions of truth-related variance (~0.33 each).finding0.736
Shows that Burger et al.'s layer choice corresponds to a transitional phase, not a universal property.
Mistral-7B on False Belief (IIT 4.0) is the sole case exhibiting statistically significant Φ differences between score categories under temporal permutation at the task level.finding0.732
Only Criterion 2 is satisfied for this single case at the task level (granularity without aggregation).