paper:doi-10-48550-arxiv-2506-22516Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis
TL;DR
Applying Integrated Information Theory (IIT) versions 3.0 and 4.0 to sequences of internal representations from four open-source LLMs — LLaMA3.1-8B, LLaMA3.1-70B, Mistral-7B, and Mixtral-8x7B — across five Theory of Mind task categories yields no statistically significant evidence of observable "consciousness" phenomena under the three criteria established by this work. The analytical instrument introduced is the Representation Network (RN), a hypothetical network constructed by treating each PCA-reduced embedding dimension (collapsed to D=4 nodes) as a node, with the token sequence forming a time series of binary network states; PyPhi software then computes μΦmax (IIT 3.0) and μΦ (IIT 4.0) as weighted averages over all 16 possible states. Across 165,365 valid samples spanning 12 proportionally sampled transformer layers per model and three linguistic span conditions, IIT-derived Φ estimates fail to reliably discriminate ToM performance score categories, while a consciousness-agnostic Span Representation metric consistently achieves higher mean AUC in 5×5-fold cross-validated logistic regression — the sole exception being spatio-permutation controls, under which two cases (notably Layer 32 of Mixtral-8x7B on Strange Stories with IIT 4.0, across entire-text and complement spans) satisfy all three criteria simultaneously. The paper argues this implies that contemporary Transformer-based LLMs' representation sequences encode performance-relevant information in span-level geometry rather than in IIT-measurable integrated information, though spatio-permutation results leave open the possibility that future agentic systems consuming LLM representations in non-autoregressive modes could yield representations observable as conscious.
What to take away
- 1. Across 165,365 valid samples from four LLMs (LLaMA3.1-8B, LLaMA3.1-70B, Mistral-7B, Mixtral-8x7B) and five Theory of Mind task categories, no configuration simultaneously satisfies all three criteria for observing a 'consciousness' phenomenon under temporal permutation controls.
- 2. The Representation Network (RN) is constructed by applying PCA to reduce each model's token-level hidden states to D=4 nodes, binarizing each node's time series relative to its own mean signal, and computing μΦmax (IIT 3.0) and μΦ (IIT 4.0) as state-frequency-weighted averages over 16 possible binary network states using PyPhi.
- 3. A consciousness-agnostic Span Representation — concatenating boundary token vectors, their element-wise product, and their difference following Peters et al. (2018) — outperforms all IIT-derived metrics in mean AUC from 5-time 5-fold cross-validated logistic regression across nearly all ToM tasks and LLMs under temporal permutation controls.
- 4. Under spatio-permutation controls (random shuffling of the embedding dimension order), two cases satisfy all three criteria: Layer 32 (indexed at 11) of Mixtral-8x7B on Strange Stories (2 scores) under IIT 4.0 with both entire-text and complement linguistic spans, representing a tentative indication of observable 'consciousness' phenomena.
- 5. Promising cases satisfying Criteria 1 and 2 but not Criterion 3 cluster in deeper transformer layers, including Layer 24 (indexed at 8) of LLaMA3.1-8B on Hinting under both IIT 3.0 and 4.0, and the approximately 2/3-depth layer, consistent with prior findings that these layers best predict human brain activity (Schrimpf et al., 2021; Caucheteux et al., 2023).
- 6. Directing response representations' attention toward complement syntax or mental-state verb spans — linguistic features known to drive Theory of Mind development in human children — produces no meaningful improvement in IIT estimates' ability to discriminate ToM performance scores, suggesting a fundamental discrepancy between natural and artificial intelligence in language-cognition coupling.
- 7. Text augmentation targeting a minimum of 1,000 words per score category per stimulus was implemented using GPT-4o (gpt-4o-2024-08-06), Claude 3.5 Sonnet (claude-3-5-sonnet@20240620), and Gemini (google/gemini-1.5-flash-002) to satisfy the Markov property and conditional independence requirements of PyPhi, with optimal concatenation found via heuristic search over token counts from 50 to 1,000 in steps of 50.
- 8. No substantial difference in consciousness indicators was found between larger models (Mixtral-8x7B, LLaMA3.1-70B) and smaller counterparts (Mistral-7B, LLaMA3.1-8B), though this comparison is confounded by substantial sample loss during PyPhi network initialization for IIT 4.0 — particularly severe for Mixtral-8x7B — and by 4-bit quantization required by hardware constraints.
- 9. An open question the paper raises is whether agentic AI systems that produce and consume LLM representations outside the autoregressive next-token prediction paradigm might generate representation sequences for which IIT estimates would cross the threshold of observable 'consciousness' phenomena, since the RN is explicitly defined as independent of the generating model's architecture once representations are produced.
- 10. The study's Criterion 1 threshold — requiring 'good' cases (higher score yielding higher mean Φ) to exceed 80% of valid stimuli per ToM task — is never met across any of the 13 sampled transformer layers, 3 linguistic span conditions, or 4 LLMs under temporal permutation, with the maximum observed valid stimulus counts of 13 (Hinting), 19 (False Belief), and 12 (Irony) further limiting statistical power.
Peer brief — for seminar discussion
Li (2025) asks whether 'consciousness' phenomena — operationalized through Integrated Information Theory — can be detected in the internal representation sequences of Transformer-based LLMs when those sequences are treated as time series of network states. Using human responses from the Strachan et al. (2024) Theory of Mind dataset (publicly available at osf.io/dbn92), which covers Hinting, False Belief, Strange Stories, and Irony Comprehension tasks with 0/1 or 0/1/2 score ratings, the paper extracts hidden states from four open-source models — LLaMA3.1-8B, LLaMA3.1-70B, Mistral-7B, and Mixtral-8x7B — across 12 proportionally sampled transformer layers each. The method introduced is the Representation Network (RN): hidden states are PCA-compressed to 4 dimensions, z-scored and binarized per-node relative to each node's mean signal, yielding a 4-node binary network with 16 possible states, from which IIT 3.0's μΦmax and IIT 4.0's μΦ are computed as state-frequency-weighted averages via PyPhi. Scaled dot-product attention is used to contextualize each response representation with respect to its stimulus, producing Attended Response Representations (ARR) and, for linguistic-span analyses, Contextually Attended Response Representations (CARR). The alternative method that could have been used — and was explicitly considered and rejected — is mean pooling of stimulus representations into a single context vector, which would destroy the token-level temporal structure the RN depends on. The load-bearing finding is a null result with an interesting exception: across 165,365 valid samples and under temporal permutation controls, no configuration satisfies all three pre-specified criteria for observable 'consciousness' simultaneously — criteria requiring (1) Φ to order correctly across ToM score categories in >80% of stimuli per task, (2) Wilcoxon-significant Φ separation across score categories at p<0.05, and (3) IIT metrics to achieve higher mean AUC than the consciousness-agnostic Span Representation in 5×5-fold cross-validated logistic regression. Under spatio-permutation controls, however, two cases (Layer 32 of Mixtral-8x7B on Strange Stories under IIT 4.0 with entire-text and complement spans) do satisfy all three criteria, and IIT metrics broadly lose their consistent inferiority to Span Representation in that condition. What this implies is that variation in ToM performance scores is primarily encoded in the span-level geometry of LLM representation sequences rather than in IIT-measurable integrated information, but that the embedding dimension order carries structure relevant to consciousness-like organization — the spatio-permutation finding is the paper's core hypothesis for future work: that agentic systems consuming LLM representations outside the autoregressive paradigm might cross the observability threshold. A critical reader would push back hardest on the dimensionality reduction to D=4 nodes via PCA as the computational bottleneck forcing this choice. IIT's complexity scales as O(n·5^n), so 4 nodes is essentially the practical ceiling with PyPhi, but compressing embeddings of dimension 4,096 (LLaMA3.1-8B, Mistral-7B, Mixtral-8x7B) or 8,192 (LLaMA3.1-70B) down to 4 principal components almost certainly discards the vast majority of variance that could carry consciousness-relevant structure. The paper acknowledges this but treats it as an acceptable tradeoff; a skeptic would argue that the null result may be entirely an artifact of this compression rather than a property of LLM representations per se. Additionally, the text augmentation procedure — using GPT-4o and Claude 3.5 Sonnet to pad human responses to a minimum of 1,000 words — introduces LLM-generated text into what is supposed to be a corpus of conscious human experience, potentially contaminating the very signal the RN is meant to detect.
Findings (13)
- Under spatio permutation controls, IIT consciousness estimates outperform Span Representation in mean AUC in several cases (LLaMA3.1-70B on Hinting and Irony, Mistral-7B on Irony, LLaMA3.1-8B on Strange Stories).
Contrasts with temporal permutation where Span Representation dominates; suggests spatio permutation reveals different dynamics.
- The case at approximately the 2/3 layer of LLaMA3.1-8B (Layer 24, satisfying Criteria 1 and 2) aligns with prior studies showing the 2/3 layer optimally predicts human brain activity.
Connects this study's results to Schrimpf et al. 2021 and Caucheteux et al. 2022/2023 findings on brain-LLM alignment.
- Mistral-7B on False Belief (IIT 4.0) is the sole case exhibiting statistically significant Φ differences between score categories under temporal permutation at the task level.
Only Criterion 2 is satisfied for this single case at the task level (granularity without aggregation).
- No significant disparity in potential consciousness indicators was found between larger models (Mixtral-8x7B, LLaMA3.1-70B) and smaller counterparts (Mistral-7B, LLaMA3.1-8B).
Contradicts expectation from emergent abilities literature; however, interpreted cautiously due to methodological limitations.
- Directing response attention to complement syntax and/or mental state verbs (MSV) yields no significant alterations in IIT estimates compared to entire stimulus analysis.
Suggests LLMs do not represent complement/MSV linguistic features in the same way as they are crucial for human ToM development.
- Under spatio permutation controls, two cases (Layer 32 of Mixtral-8x7B on Strange Stories, IIT 4.0, Linguistic Spans: Entire and Complement) satisfy all three criteria.
Contrasts with temporal permutation results; constitutes the most suggestive evidence of potential consciousness phenomena in LLM representations.
- Under temporal permutation control, no cases meeting all three criteria for observed 'consciousness' phenomenon were found among the 165,365 valid samples.
Primary negative result of the study: temporal permutation analysis finds no statistically significant indicators of consciousness in LLM representations.
- Layer 29 (indexed at 10) of LLaMA3.1-8B on Strange Stories (2 scores) satisfies Criteria 1 and 2 under IIT 4.0 (temporal permutation).
Third promising case from temporal permutation analysis.
- Layer 24 (indexed at 8) of LLaMA3.1-8B on Hinting satisfies Criteria 1 and 2 under both IIT 3.0 and IIT 4.0 (temporal permutation).
One of the most promising cases; approximately corresponds to the 2/3 layer of LLaMA3.1-8B.
- None of the cases identified under temporal permutation satisfy the Criterion 1 threshold of >80% 'good' cases for any ToM task.
Even the rare cases where good > bad do not reach the 80% significance threshold required by Criterion 1.
Claims (11)
- Complement syntax and mental state verb comprehension abilities crucial for human ToM development are not significantly represented in LLMs, revealing fundamental discrepancies between natural and artificial intelligence regarding mind development.
Derived from the finding that linguistic span focusing on complements/MSV yields no significant IIT estimate changes.
- The absolute or isolated value of any IIT estimate lacks intrinsic meaning; IIT estimates are only interpretable when compared across varying levels or contents of consciousness under consistent contextual conditions.
Methodological constraint adopted from IIT literature to justify the comparative experimental design.
- The LLM itself cannot 'experience' what it generates and therefore cannot possess consciousness; the RN is a higher-level construct that is independent of the LLM's architecture once representations are generated.
Key theoretical position distinguishing analysis of representations from analysis of LLM architecture.
- Variations in ToM test score categories are more likely attributed to span-level information of the LLM representation sequence rather than to a 'consciousness' phenomenon as suggested by IIT estimates.
Main interpretive finding from Criterion 3 comparison showing Span Representation consistently outperforms IIT under temporal permutation.
- Theory of Mind is a subset of cognitive abilities enabled by consciousness, not its equivalent; consciousness is a prerequisite for ToM, but ToM is not the entirety of consciousness.
Theoretical clarification distinguishing ToM from consciousness to frame the study's approach.
- Scaled dot-product attention is the most faithful, structured, and theoretically grounded method for incorporating stimulus influence into response representations leading to an RN.
Justifies the methodological choice of attention over concatenation, mean pooling, residual connections, or joint embedding.
- It is plausible that ongoing developments in LLMs may lead to models or agentic systems built on LLMs capable of generating representations observed with 'consciousness' phenomena.
Forward-looking claim suggesting the methodological framework is relevant for future AI systems beyond current LLMs.
- Sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed 'consciousness' phenomena under the three stringent criteria.
Primary conclusion of the study based on temporal permutation analysis failing all three criteria.
- PCA is the appropriate dimensionality reduction technique for constructing the RN because it preserves global structure and provides deterministic, interpretable projections.
Justifies PCA choice over UMAP or t-SNE for the node-structured RN model.
- IIT alone cannot serve as a definitive criterion for identifying consciousness in LLM representations due to its panpsychist implications challenging its specificity.
Motivates the hybrid approach combining IIT, Span Representation, and multiple criteria.
Hypotheses (4)
- We hypothesize that a Representation Network (RN) emerges from LLM representations, where each dimension is a node and latent connections exist between nodes or clusters of nodes.
Core methodological hypothesis enabling the application of IIT to LLM representation sequences.
- We hypothesize that 'consciousness' phenomena can be observed in the internal states of an LLM, specifically in its learned representations when analyzed as a sequence.
Primary research hypothesis driving the entire study; operationalized via three criteria.
- If 'consciousness' phenomenon can be observed from ToM-related RN, higher ToM test scores should yield higher values of μΦmax (IIT 3.0) and/or μΦ (IIT 4.0).
Specific prediction linking IIT's prediction of high Φ for good performance to the experimental design's scoring structure.
- We hypothesize that potential 'consciousness' phenomena are preferentially associated with deeper transformer layers and the 2/3 layer of LLMs.
Derived from observed alignment of promising cases with semantically rich deeper layers and the brain-aligned 2/3 layer.
Questions (6)
- Can IIT estimates provide a stronger basis for interpreting variations in ToM performance than Span Representation, independent of any consciousness estimate?
Criterion 3 operationalization: requires IIT mean AUC to exceed Span Representation mean AUC.
- What is the relationship between different dimensions or clusters of dimensions in LLM representations? Do they and/or how do they interact with each other?
Motivates the RN hypothesis by pointing to the unknown relational structure within high-dimensional representation vectors.
- Can 'consciousness' be observed in the internal states of an LLM, specifically in its learned representations, particularly when analyzed as a sequence?
The primary research question framing the entire study.
- Is 'experience' encoded in sequences of LLM representations beyond mere 'knowledge,' 'understanding,' 'value,' or 'position'?
Secondary question motivating the IIT analysis; asks whether LLM hidden states contain something beyond propositional content.
- Can estimates of Φ, the primary metric of IIT, robustly differentiate responses across distinct ToM performance levels?
Criterion 1 operationalization: requires >80% 'good' cases (higher score → higher Φ) per ToM task.
- Do distinctions in Φ estimates remain robust across diverse ToM stimuli in repeated large-scale trials?
Criterion 2 operationalization: requires p<0.05 in Wilcoxon tests across score categories.
Original abstract (expand)
Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 -- the latest iterations of this framework -- to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., $Φ^{\max}$ (IIT 3.0), $Φ$ (IIT 4.0), Conceptual Information (IIT 3.0), and $Φ$-structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential "consciousness" phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed "consciousness" phenomena but exhibit intriguing patterns under $\textit{spatio}$-permutational analyses. The Appendix and code are available as Supplementary Materials at: https://doi.org/10.1016/j.nlp.2025.100163.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Toward IIT-Inspired Consciousness in LLMs: A Reward-Based Learning FrameworkMohammad Hossein Sameti, Amir M. Mansourian, Mohammad Hossein Rohban, Hossein Sameti Hamid Reza Akbari2026≈ 87%
- Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human BrainZiv M. Williams, Jing Cai Mohsen Jamali2023≈ 87%
- Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories ParadigmAndras Lukacs, Peter Vedres, Zeteny Bujka Anna Babarczy2026≈ 87%
- ≈ 87%
- ≈ 86%
- Re-evaluating Theory of Mind evaluation in large language modelsFelix Sosa, Tomer Ullman Jennifer Hu2025≈ 85%
- ≈ 85%
- GPT-4o Lacks Core Features of Theory of MindAmanda Royka, Shane Lee, and Julian Jara-Ettinger John Muchovej2026≈ 85%
- A Systematic Review on the Evaluation of Large Language Models in Theory of Mind TasksK{\i}van\c{c} Tez\"oren, Yavuz Durmazkeser Karahan Sar{\i}ta\c{s}2025≈ 85%
- ≈ 85%
- ≈ 85%
- Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human StatesJiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, Pengfei Liu Yang Xiao2025≈ 85%
- From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language ModelsSiqi Liu, Bochao Zou, Jiansheng Chen, Huimin Ma Xinyang Li2025≈ 85%
- Probing the Robustness of Theory of Mind in Large Language ModelsLaura Schrewe, Lucie Flek Christian Nickel2024≈ 84%
- A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety RisksHieu Minh "Jord" Nguyen2025≈ 84%
- Theory of Mind in Large Language Models: Assessment and EnhancementWeifeng Jiang, Chengwei Qin, Cheston Tan Ruirui Chen2025≈ 84%
- The consciousness priorcited2017≈ 84%
- ≈ 83%
- ≈ 83%
- cimcWhitepaperin corpus≈ 83%
- The Machine Consciousness Hypothesisin corpus≈ 82%
- ≈ 82%
- The Platonic Representation Hypothesisin corpus2024≈ 82%
- ≈ 82%
- From the Phenomenology to the Mechanisms of Consciousness: Integrated Information Theory 3.0cited2014≈ 81%
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 81%
- Taking AI Welfare Seriouslyin corpus2024≈ 81%
- ≈ 80%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 80%
- Anima Labs Phenomenology Pt1in corpus≈ 79%
+20 more