paper:doi-10-1162-neco-a-00999Active Inference, Curiosity and Insight
TL;DR
Minimizing expected variational free energy under a discrete-state Markov decision process generative model is sufficient to produce curiosity, epistemic learning, and insight without any additional machinery. Friston et al. 2017 demonstrates this across two linked mechanisms: first, including posterior beliefs about likelihood parameters **A** in expected free energy G(π) introduces a novelty term—information gain about model parameters—that drives agents to sample combinations of hidden states and outcomes they have not yet encountered, resolving ignorance rather than merely ambiguity or risk. Second, Bayesian model reduction (implemented via the spm_MDP_VB_X.m routine in SPM) allows post-hoc or online pruning of redundant concentration parameters: a reduced model is accepted when ΔF ≤ −3, corresponding to a Bayes factor of approximately 20:1 in favor of the simpler model. Simulated agents learning a 3-rule, 4-factor abstract contingency task (144 hidden-state combinations, 36 possible outcomes) reach near-perfect performance after roughly 14 trials under pure epistemic learning, dropping to approximately 10 trials when online Bayesian model reduction is applied across 64 simulated agents. The sleep analog—non-REM synaptic pruning followed by REM-like belief re-evaluation—is formalized identically through the same free energy difference equation. The paper argues this implies that aha moments are necessarily subpersonal events (optimization of the generative model itself, not modeling of that optimization), that the quality of intelligence is inversely related to the thermodynamic energy expended during convergence via the Jarzynski equality, and that communicating reduced model priors rather than parameter posteriors constitutes a principled formal account of shared knowledge—consciousness in the pre-Cartesian sense of con-scire.
What to take away
- 1. Including posterior beliefs about likelihood parameters **A** in the expected free energy G(π) introduces a novelty term—parameter information gain—that is distinct from epistemic value (state information gain) and risk, and drives policies that expose the agent to novel hidden-state/outcome combinations to resolve ignorance about contingencies.
- 2. In a 4-factor rule-learning task with 3 × 3 × 4 × 4 = 144 hidden states and 36 possible outcome combinations, a simulated active inference agent reaches correct, confident performance after approximately 14 trials without Bayesian model reduction.
- 3. Applying online Bayesian model reduction after each trial across 64 simulated agents reduces the median trial of insight to approximately 10, with nearly all agents achieving 100% accuracy by that point, compared to chance performance of roughly 30%.
- 4. Bayesian model reduction accepts a reduced (simpler) prior when the free energy difference ΔF ≤ −3, corresponding to an odds ratio (Bayes factor) of exp(−3) ≈ 0.05, meaning the reduced model is approximately 20 times more likely than the full model.
- 5. The paper implements a sleep analog in which non-REM sleep corresponds to synaptic pruning (zeroing redundant concentration parameters) while REM sleep corresponds to re-evaluating posterior concentration parameters by replaying sampled outcomes under the reduced model, with simulations showing no systematic performance difference between exact REM re-evaluation and the computationally cheaper approximation of reassigning counts to surviving connections.
- 6. An open question the paper raises is whether agents anticipating forthcoming data should maintain over-parameterized (full) models as evolutionary endowment—analogous to infant cortical over-connectivity—that are subsequently pruned through experience-dependent plasticity, implying that the appropriate prior model complexity is itself a free parameter requiring empirical investigation.
- 7. To replicate the curiosity simulations, a researcher can run the exact belief-update routines in SPM's spm_MDP_VB_X.m (available at http://www.fil.ion.ucl.ac.uk/spm/) by typing >>DEM and selecting the rule-learning demo, with task structure specified entirely by the (A, B, C, D) parameter matrices.
- 8. After rule learning, simulated neuronal responses (interpreted as firing rates of units encoding the correct color hidden state) show discrimination onset shifted earlier by multiple saccadic epochs compared to pre-learning trials, providing a specific electrophysiological prediction consistent with ERP studies of insight such as the N380 anterior cingulate component reported by Mai et al. (2004).
- 9. Three subjects among 64 simulated agents exhibit superstitious Bayesian model reduction—abducting an informative color mapping in the wrong context (e.g., looking left under the center rule)—producing persistently incorrect behavior, illustrating the principled trade-off between the ampliative benefit of abduction and susceptibility to chance-consistent false models.
- 10. The paper predicts that the total thermodynamic energy expended during convergence to an approximately optimal solution is a proxy for the path integral of variational free energy via the Jarzynski equality, implying that deep reinforcement learning systems (e.g., DQN; Mnih et al., 2015) that require large compute budgets are, by this criterion, structurally incompatible with the variational principle of least action that underlies genuine machine intelligence.
Peer brief — for seminar discussion
Friston et al. (Neural Computation 29, 2633–2683, 2017) introduces two generalizations of the active inference framework for discrete Markov decision processes and uses them to give a unified computational account of curiosity, epistemic learning, and insight. The first generalization adds a novelty term to the expected free energy G(π)—specifically the information gain about likelihood parameters **A**, distinct from the existing state-information and risk terms—so that policies which expose the agent to novel combinations of hidden states and outcomes become intrinsically attractive, formalizing curiosity as ignorance-resolution. The second generalization is Bayesian model reduction applied either offline (sleep analog) or online (reflection/aha moments): after accumulating experience, redundant concentration parameters in the likelihood array are pruned whenever the free energy difference ΔF ≤ −3 (Bayes factor ≈ 20), and the resulting simpler, less ambiguous model is adopted as the new prior. The method introduced is the spm_MDP_VB_X.m routine within SPM, applied to a 4-factor abstract rule-learning task with 144 hidden-state combinations, 36 outcome combinations, and up to (4×4)^5 = 1,048,576 possible 5-step policies; the task is intentionally intractable for standard reinforcement learning or belief-state POMDP solvers at this scale. The load-bearing finding is that simulated agents without model reduction require approximately 14 trials to reach asymptotic performance, while online Bayesian model reduction across 64 agents compresses this to roughly 10 trials and near-100% accuracy, with three agents showing superstitious false insights as a predicted side-effect of ampliative abduction. Non-REM sleep is formalized as parameter pruning and REM as posterior re-evaluation via outcome replay; the paper shows these are computationally equivalent to within measurement precision. The paper predicts that aha moments are necessarily subpersonal—optimization of the model cannot itself be modeled at the same level—and invokes the Jarzynski equality to argue that thermodynamic energy cost is a proxy for variational free energy path integrals, making computational efficiency a criterion for genuine intelligence. An alternative method would have been a nonparametric Bayesian approach (e.g., IBP or CRP-based structure learning, as in Collins & Frank, 2013) that constructs model structure from the bottom up rather than pruning a full model top-down; the paper explicitly sets aside this class of methods, acknowledging it avoids hard questions about model-space construction. A critical reader would push back on the model-space specification: the 36 reduced models tested during online abduction were hand-crafted to contain the true model, meaning the paradigm demonstrates model selection within a benignly structured hypothesis space rather than establishing how such a space is generated de novo—the paper's own footnote 3 concedes this, yet the core claims about insight-as-abduction depend on it. The paper's key prediction for empirical work is that insight-dependent performance improvements should correlate with an earlier onset of discriminatory neuronal responses (shifted by at least one saccadic epoch) and that the prevalence of such improvements should roughly double following nocturnal sleep, consistent with Wagner et al. (2004, Nature 427) data cited in support.
Findings (13)
- fMRI showed increased right hemisphere anterior superior temporal gyrus activity for insight solutions, with EEG showing a gamma-band burst in the same region beginning 0.3s prior to insight (Jung-Beeman et al., 2004).
Neural correlate of insight used to support prediction about early neural activity following structure learning
- In early trials, right and left cue locations are more attractive than the central location despite being inherently ambiguous, because the agent knows it is ignorant and can resolve this through novelty exposure.
Demonstration that ignorance-driven novelty-seeking (not ambiguity avoidance) governs early exploration
- Simulated electrophysiological responses show onset of discriminatory neural activity much earlier after rule learning than before, due solely to learned likelihood mappings enabling retrospective inference.
Predicted neural signature of insight: reduced ERP latency and increased early amplitude
- 3 of 64 simulated agents exhibited superstitious (incorrect) abduction, leading to persistently poor performance, demonstrating a trade-off between ampliative benefit and susceptibility to false insight.
Demonstration of failure mode of abductive model reduction
- In simulations, positive evidence threshold for Bayesian model reduction corresponds to ΔF ≤ −3, equivalent to odds ratio of exp(−3) ≈ 0.05 (reduced model ~20 times more likely than full model).
Quantitative threshold used for accepting reduced models; linked to Bayes factor of ~20
- Resting EEG spectral analysis prior to anagram solving revealed right-lateralized hemispheric asymmetry predicting subsequent insight problem-solving (Kounios et al., 2008).
Evidence that pre-insight neural states (resting EEG) predict insight; supports role of reflection/mind wandering
- ERP study of Chinese riddles localized the N380 (aha effect marker) generator to the anterior cingulate cortex, associated with breaking of mental set (Mai et al., 2004).
Neural evidence linking aha moment to ACC and model restructuring
- Bayesian model reduction after 12 trials correctly removes off-diagonal (redundant) parameters from the likelihood array, recovering the true contingency structure.
Validation that BMR correctly identifies and prunes wrong connections in the likelihood mapping
- Nearly all of 64 simulated agents attain 100% performance at around the 10th trial when allowed to perform abductive Bayesian model reduction after each trial.
Group-level simulation result showing generalizability of BMR benefit across agents
- In single-agent simulation of 32 trials, performance becomes perfect after trial 14 without Bayesian model reduction, with confidence increasing progressively.
Baseline learning curve for pure epistemic learning without structure learning
Claims (10)
- A measure of the quality of intelligence is the thermodynamic efficiency with which it can be simulated; solutions requiring large computation violate Hamilton's principle and are not candidates for artificial intelligence capable of insight.
Normative criterion for artificial intelligence derived from variational free energy principle
- An aha moment is necessarily subpersonal: one can never remember or articulate abductive reasoning at the level of the model being optimized, because optimizing a model is fundamentally different from modeling an optimization.
Philosophical implication of associating insight with model-level (not parameter-level) optimization
- Insight ('aha moment') is produced by Bayesian model reduction—a qualitative change in generative model structure that is necessarily subpersonal and cannot be articulated at the level of the model.
Core claim linking insight to post hoc Bayesian model optimization
- Curiosity manifests as active sampling of the world to minimize uncertainty about hypotheses for states of the world, driven by novelty-seeking policies that resolve ignorance about contingencies.
Formal definition of curiosity within active inference framework
- The results of abductive reasoning (reduced model priors) can be communicated to other agents as prior beliefs, provided all agents share the same model lexicon or hypothesis space.
Explanation of how knowledge (not just parameters) is shared between agents; links to pre-Cartesian consciousness
- Sleep implements Bayesian model reduction: non-REM sleep performs synaptic regression (complexity reduction) and REM sleep re-evaluates posteriors under the new reduced model.
Neurobiological interpretation linking sleep stages to distinct computational roles in model optimization
- Curiosity, insight, decision-making, and diverse phenomena can all be accommodated by a single imperative: minimization of expected free energy (resolution of uncertainty).
Central thesis of the paper unifying cognitive phenomena under one objective function
- Active inference has no random or stochastic aspects; everything changes to minimize variational free energy in accord with Hamilton's principle of least action.
Ontological claim about the deterministic nature of active inference agents in these simulations
- Task instructions can be transcribed into prior beliefs of a generative model, making instruction-following an instance of prior belief specification.
Practical implication showing task instructions are equivalent to inducing prior beliefs in experimental settings
- The hallmark of mindful inference (consciousness) is the ability to represent or entertain counterfactual hypotheses within the same inference engine.
Proposed criterion distinguishing conscious from non-conscious inference processes
Hypotheses (3)
- We hypothesize that insight produces a profound reduction in the latency of evoked neuronal responses when subjects know the meaning of cues (have learned a rule), equivalently an increase in ERP amplitude to initial cues in a sequence.
Empirically testable neural prediction of the active inference model of insight
- Human participants in the rule-learning paradigm should acquire insight after approximately 7–8 trials, fewer than the ~14 required by Bayes-optimal inference without model reduction, suggesting they perform Bayesian model selection.
Based on informal audience experiments; implies people use prior knowledge about rule structure
- Sleep (Bayesian model reduction) should improve performance on rule-learning tasks, with the prevalence of insight-dependent performance changes roughly doubling after nocturnal sleep.
Prediction consistent with Wagner et al. (2004) finding; extended to the active inference account of sleep
Questions (7)
- Do human participants demonstrate the same insight dynamics predicted by active inference in the rule-learning paradigm (currently under investigation with eye tracking and crowd-sourced reaction times)?
Empirical gap explicitly acknowledged; experiments reportedly in progress at time of writing
- What are the neuronal mechanisms by which prior beliefs from one agent's model are received and properly implemented by a naive agent (neuronal hermeneutics)?
Open question about inter-agent communication beyond model-space assumption
- Why, out of the infinite range of knowable items in the universe, are certain pieces of knowledge more ardently sought and more readily retained than others?
Second of Berlyne's (1954) framing questions; answered by Bayesian model reduction selecting parsimonious models
- What is the nature of the shared model space or lexicon that enables naive agents to properly implement received priors from experienced agents?
Prerequisite for model-level communication; raises issues of neural hermeneutics
- What do we communicate to each other: is this knowledge or meta-knowledge conveyed in the form of prior beliefs?
Open question about inter-agent communication of model structure vs. parameters
- How are model spaces constructed and explored in the absence of a full model (bottom-up structure learning)?
Research gap explicitly identified: the paper uses top-down BMR from a full model, avoiding this challenge
- Why do human beings devote so much time and effort to the acquisition of knowledge?
First of Berlyne's (1954) framing questions; answered by curiosity as expected free energy minimization (novelty)
Original abstract (expand)
This article offers a formal account of curiosity and insight in terms of active (Bayesian) inference. It deals with the dual problem of inferring states of the world and learning its statistical structure. In contrast to current trends in machine learning (e.g., deep learning), we focus on how people attain insight and understanding using just a handful of observations, which are solicited through curious behavior. We use simulations of abstract rule learning and approximate Bayesian inference to show that minimizing (expected) variational free energy leads to active sampling of novel contingencies. This epistemic behavior closes explanatory gaps in generative models of the world, thereby reducing uncertainty and satisfying curiosity. We then move from epistemic learning to model selection or structure learning to show how abductive processes emerge when agents test plausible hypotheses about symmetries (i.e., invariances or rules) in their generative models. The ensuing Bayesian model reduction evinces mechanisms associated with sleep and has all the hallmarks of "aha" moments. This formulation moves toward a computational account of consciousness in the pre-Cartesian sense of sharable knowledge (i.e., con: "together"; scire: "to know").
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Active Inference: A Process Theorycitedin corpus2017≈ 88%
- ≈ 90%
- Active inference and artificial reasoningLancelot Da Costa, Alexander Tschantz, Conor Heins, Christopher Buckley, Tim Verbelen, Thomas Parr Karl Friston2025≈ 88%
- ≈ 88%
- Curiosity is Knowledge: Self-Consistent Learning and No-Regret Optimization with Active InferenceAnjali Parashar, Enlu Zhou, and Chuchu Fan Yingke Li2026≈ 87%
- ≈ 87%
- Active Inference and Epistemic Value in Graphical ModelsMagnus Koudahl, Bart van Erp, Bert de Vries Thijs van de Laar2022≈ 86%
- Modeling arousal potential of epistemic emotions using Bayesian information gain: Inquiry cycle driven by free energy fluctuationsShimon Honda Hideyoshi Yanagisawa2025≈ 86%
- Active inference and epistemic valuecited2015≈ 86%
- ≈ 86%
- Reframing the Expected Free Energy: Four Formulations and a UnificationHoward Bowman, Dimitrije Markovi\'c, Marek Grze\'s Th\'eophile Champion2024≈ 86%
- Life as we know itcitedin corpus2013≈ 80%
- Active inference: demystified and comparedin corpus2021≈ 85%
- Generative models as parsimonious descriptions of sensorimotor loopsChristopher L. Buckley Manuel Baltieri2019≈ 85%
- ≈ 85%
- ≈ 85%
- ≈ 85%
- Online reinforcement learning with sparse rewards through an active inference capsuleCharel van Hoof (1), Beren Millidge (2) ((1) Delft University of Technology, (2) University of Oxford) Alejandro Daniel Noel (1)2021≈ 85%
- ≈ 85%
- Active inference for action-unaware agentsKeisuke Suzuki, Ryota Kanai, Manuel Baltieri Filippo Torresan2025≈ 84%
- Post hoc Bayesian model selectioncited2011≈ 84%
- Active inference, Bayesian optimal design, and expected utilityLancelot Da Costa, Thomas Parr, Karl Friston Noor Sajid2021≈ 84%
- ≈ 84%
- Realising Synthetic Active Inference Agents, Part II: Variational Message UpdatesMagnus Koudahl and Bert de Vries Thijs van de Laar2025≈ 84%
- ≈ 84%
- ≈ 83%
- ≈ 83%
- ≈ 82%
- ≈ 82%
- ≈ 82%
+25 more
Similar preprints — Semantic Scholar
Cited by (1)
- Active inference on discrete state-spaces: a synthesis
Active inference on discrete state-spaces, formalized as partially observable Markov decision processes (POMDPs) with likelihood matrix A, transition matrix B, and prior D, unifies perception, plannin