paper
active
2025
paper:doi-10-48550-arxiv-2506-22516

Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis

TL;DR

Applying Integrated Information Theory (IIT) versions 3.0 and 4.0 to sequences of internal representations from four open-source LLMs — LLaMA3.1-8B, LLaMA3.1-70B, Mistral-7B, and Mixtral-8x7B — across five Theory of Mind task categories yields no statistically significant evidence of observable "consciousness" phenomena under the three criteria established by this work. The analytical instrument introduced is the Representation Network (RN), a hypothetical network constructed by treating each PCA-reduced embedding dimension (collapsed to D=4 nodes) as a node, with the token sequence forming a time series of binary network states; PyPhi software then computes μΦmax (IIT 3.0) and μΦ (IIT 4.0) as weighted averages over all 16 possible states. Across 165,365 valid samples spanning 12 proportionally sampled transformer layers per model and three linguistic span conditions, IIT-derived Φ estimates fail to reliably discriminate ToM performance score categories, while a consciousness-agnostic Span Representation metric consistently achieves higher mean AUC in 5×5-fold cross-validated logistic regression — the sole exception being spatio-permutation controls, under which two cases (notably Layer 32 of Mixtral-8x7B on Strange Stories with IIT 4.0, across entire-text and complement spans) satisfy all three criteria simultaneously. The paper argues this implies that contemporary Transformer-based LLMs' representation sequences encode performance-relevant information in span-level geometry rather than in IIT-measurable integrated information, though spatio-permutation results leave open the possibility that future agentic systems consuming LLM representations in non-autoregressive modes could yield representations observable as conscious.

What to take away

  1. 1. Across 165,365 valid samples from four LLMs (LLaMA3.1-8B, LLaMA3.1-70B, Mistral-7B, Mixtral-8x7B) and five Theory of Mind task categories, no configuration simultaneously satisfies all three criteria for observing a 'consciousness' phenomenon under temporal permutation controls.
  2. 2. The Representation Network (RN) is constructed by applying PCA to reduce each model's token-level hidden states to D=4 nodes, binarizing each node's time series relative to its own mean signal, and computing μΦmax (IIT 3.0) and μΦ (IIT 4.0) as state-frequency-weighted averages over 16 possible binary network states using PyPhi.
  3. 3. A consciousness-agnostic Span Representation — concatenating boundary token vectors, their element-wise product, and their difference following Peters et al. (2018) — outperforms all IIT-derived metrics in mean AUC from 5-time 5-fold cross-validated logistic regression across nearly all ToM tasks and LLMs under temporal permutation controls.
  4. 4. Under spatio-permutation controls (random shuffling of the embedding dimension order), two cases satisfy all three criteria: Layer 32 (indexed at 11) of Mixtral-8x7B on Strange Stories (2 scores) under IIT 4.0 with both entire-text and complement linguistic spans, representing a tentative indication of observable 'consciousness' phenomena.
  5. 5. Promising cases satisfying Criteria 1 and 2 but not Criterion 3 cluster in deeper transformer layers, including Layer 24 (indexed at 8) of LLaMA3.1-8B on Hinting under both IIT 3.0 and 4.0, and the approximately 2/3-depth layer, consistent with prior findings that these layers best predict human brain activity (Schrimpf et al., 2021; Caucheteux et al., 2023).
  6. 6. Directing response representations' attention toward complement syntax or mental-state verb spans — linguistic features known to drive Theory of Mind development in human children — produces no meaningful improvement in IIT estimates' ability to discriminate ToM performance scores, suggesting a fundamental discrepancy between natural and artificial intelligence in language-cognition coupling.
  7. 7. Text augmentation targeting a minimum of 1,000 words per score category per stimulus was implemented using GPT-4o (gpt-4o-2024-08-06), Claude 3.5 Sonnet (claude-3-5-sonnet@20240620), and Gemini (google/gemini-1.5-flash-002) to satisfy the Markov property and conditional independence requirements of PyPhi, with optimal concatenation found via heuristic search over token counts from 50 to 1,000 in steps of 50.
  8. 8. No substantial difference in consciousness indicators was found between larger models (Mixtral-8x7B, LLaMA3.1-70B) and smaller counterparts (Mistral-7B, LLaMA3.1-8B), though this comparison is confounded by substantial sample loss during PyPhi network initialization for IIT 4.0 — particularly severe for Mixtral-8x7B — and by 4-bit quantization required by hardware constraints.
  9. 9. An open question the paper raises is whether agentic AI systems that produce and consume LLM representations outside the autoregressive next-token prediction paradigm might generate representation sequences for which IIT estimates would cross the threshold of observable 'consciousness' phenomena, since the RN is explicitly defined as independent of the generating model's architecture once representations are produced.
  10. 10. The study's Criterion 1 threshold — requiring 'good' cases (higher score yielding higher mean Φ) to exceed 80% of valid stimuli per ToM task — is never met across any of the 13 sampled transformer layers, 3 linguistic span conditions, or 4 LLMs under temporal permutation, with the maximum observed valid stimulus counts of 13 (Hinting), 19 (False Belief), and 12 (Irony) further limiting statistical power.

Peer brief — for seminar discussion

Li (2025) asks whether 'consciousness' phenomena — operationalized through Integrated Information Theory — can be detected in the internal representation sequences of Transformer-based LLMs when those sequences are treated as time series of network states. Using human responses from the Strachan et al. (2024) Theory of Mind dataset (publicly available at osf.io/dbn92), which covers Hinting, False Belief, Strange Stories, and Irony Comprehension tasks with 0/1 or 0/1/2 score ratings, the paper extracts hidden states from four open-source models — LLaMA3.1-8B, LLaMA3.1-70B, Mistral-7B, and Mixtral-8x7B — across 12 proportionally sampled transformer layers each. The method introduced is the Representation Network (RN): hidden states are PCA-compressed to 4 dimensions, z-scored and binarized per-node relative to each node's mean signal, yielding a 4-node binary network with 16 possible states, from which IIT 3.0's μΦmax and IIT 4.0's μΦ are computed as state-frequency-weighted averages via PyPhi. Scaled dot-product attention is used to contextualize each response representation with respect to its stimulus, producing Attended Response Representations (ARR) and, for linguistic-span analyses, Contextually Attended Response Representations (CARR). The alternative method that could have been used — and was explicitly considered and rejected — is mean pooling of stimulus representations into a single context vector, which would destroy the token-level temporal structure the RN depends on. The load-bearing finding is a null result with an interesting exception: across 165,365 valid samples and under temporal permutation controls, no configuration satisfies all three pre-specified criteria for observable 'consciousness' simultaneously — criteria requiring (1) Φ to order correctly across ToM score categories in >80% of stimuli per task, (2) Wilcoxon-significant Φ separation across score categories at p<0.05, and (3) IIT metrics to achieve higher mean AUC than the consciousness-agnostic Span Representation in 5×5-fold cross-validated logistic regression. Under spatio-permutation controls, however, two cases (Layer 32 of Mixtral-8x7B on Strange Stories under IIT 4.0 with entire-text and complement spans) do satisfy all three criteria, and IIT metrics broadly lose their consistent inferiority to Span Representation in that condition. What this implies is that variation in ToM performance scores is primarily encoded in the span-level geometry of LLM representation sequences rather than in IIT-measurable integrated information, but that the embedding dimension order carries structure relevant to consciousness-like organization — the spatio-permutation finding is the paper's core hypothesis for future work: that agentic systems consuming LLM representations outside the autoregressive paradigm might cross the observability threshold. A critical reader would push back hardest on the dimensionality reduction to D=4 nodes via PCA as the computational bottleneck forcing this choice. IIT's complexity scales as O(n·5^n), so 4 nodes is essentially the practical ceiling with PyPhi, but compressing embeddings of dimension 4,096 (LLaMA3.1-8B, Mistral-7B, Mixtral-8x7B) or 8,192 (LLaMA3.1-70B) down to 4 principal components almost certainly discards the vast majority of variance that could carry consciousness-relevant structure. The paper acknowledges this but treats it as an acceptable tradeoff; a skeptic would argue that the null result may be entirely an artifact of this compression rather than a property of LLM representations per se. Additionally, the text augmentation procedure — using GPT-4o and Claude 3.5 Sonnet to pad human responses to a minimum of 1,000 words — introduces LLM-generated text into what is supposed to be a corpus of conscious human experience, potentially contaminating the very signal the RN is meant to detect.

Findings (13)

Claims (11)

Hypotheses (4)

Questions (6)

Original abstract (expand)

Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 -- the latest iterations of this framework -- to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., $Φ^{\max}$ (IIT 3.0), $Φ$ (IIT 4.0), Conceptual Information (IIT 3.0), and $Φ$-structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential "consciousness" phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed "consciousness" phenomena but exhibit intriguing patterns under $\textit{spatio}$-permutational analyses. The Appendix and code are available as Supplementary Materials at: https://doi.org/10.1016/j.nlp.2025.100163.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+20 more

Similar preprints — Semantic Scholar