concept

active

concept:criterion-3-iit-estimates-must-achieve-higher-mean-auc-than-span-representation-for-tom-score-classification

Criterion 3: IIT estimates must achieve higher mean AUC than Span Representation for ToM score classification.

Third of three operational criteria; distinguishes consciousness from inherent LLM representational separations.

Neighborhood — ranked by edge-count

Frameworks (1)

framework

Span Representation Analysis
associated_with
Framework for characterizing span-level information of sequences of representations, independent of any consciousness estimate; used as a comparison baseline.

Concepts (1)

concept

Criterion 2: Statistically significant Φ value differences (p<0.05) across ToM score categories via Wilcoxon test.
associated_with
Second of three operational criteria; requires distributional significance in IIT estimates across performance levels.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Can IIT estimates provide a stronger basis for interpreting variations in ToM performance than Span Representation, independent of any consciousness estimate?question0.841
Criterion 3 operationalization: requires IIT mean AUC to exceed Span Representation mean AUC.
Under spatio permutation controls, IIT consciousness estimates outperform Span Representation in mean AUC in several cases (LLaMA3.1-70B on Hinting and Irony, Mistral-7B on Irony, LLaMA3.1-8B on Strange Stories).finding0.814
Contrasts with temporal permutation where Span Representation dominates; suggests spatio permutation reveals different dynamics.
Variations in ToM test score categories are more likely attributed to span-level information of the LLM representation sequence rather than to a 'consciousness' phenomenon as suggested by IIT estimates.claim0.807
Main interpretive finding from Criterion 3 comparison showing Span Representation consistently outperforms IIT under temporal permutation.
Can estimates of Φ, the primary metric of IIT, robustly differentiate responses across distinct ToM performance levels?question0.794
Criterion 1 operationalization: requires >80% 'good' cases (higher score → higher Φ) per ToM task.
Criterion 1: Φ estimates must yield >80% 'good' cases (higher score → higher Φ) per ToM task to indicate potential consciousness.concept0.772
First of three operational criteria for identifying consciousness phenomena in LLM representations.
None of the cases identified under temporal permutation satisfy the Criterion 1 threshold of >80% 'good' cases for any ToM task.finding0.764
Even the rare cases where good > bad do not reach the 80% significance threshold required by Criterion 1.
Directing response attention to complement syntax and/or mental state verbs (MSV) yields no significant alterations in IIT estimates compared to entire stimulus analysis.finding0.763
Suggests LLMs do not represent complement/MSV linguistic features in the same way as they are crucial for human ToM development.
If 'consciousness' phenomenon can be observed from ToM-related RN, higher ToM test scores should yield higher values of μΦmax (IIT 3.0) and/or μΦ (IIT 4.0).hypothesis0.761
Specific prediction linking IIT's prediction of high Φ for good performance to the experimental design's scoring structure.