paper
active
2017
377
paper:doi-10-1162-neco-a-00999

Active Inference, Curiosity and Insight

TL;DR

Minimizing expected variational free energy under a discrete-state Markov decision process generative model is sufficient to produce curiosity, epistemic learning, and insight without any additional machinery. Friston et al. 2017 demonstrates this across two linked mechanisms: first, including posterior beliefs about likelihood parameters **A** in expected free energy G(π) introduces a novelty term—information gain about model parameters—that drives agents to sample combinations of hidden states and outcomes they have not yet encountered, resolving ignorance rather than merely ambiguity or risk. Second, Bayesian model reduction (implemented via the spm_MDP_VB_X.m routine in SPM) allows post-hoc or online pruning of redundant concentration parameters: a reduced model is accepted when ΔF ≤ −3, corresponding to a Bayes factor of approximately 20:1 in favor of the simpler model. Simulated agents learning a 3-rule, 4-factor abstract contingency task (144 hidden-state combinations, 36 possible outcomes) reach near-perfect performance after roughly 14 trials under pure epistemic learning, dropping to approximately 10 trials when online Bayesian model reduction is applied across 64 simulated agents. The sleep analog—non-REM synaptic pruning followed by REM-like belief re-evaluation—is formalized identically through the same free energy difference equation. The paper argues this implies that aha moments are necessarily subpersonal events (optimization of the generative model itself, not modeling of that optimization), that the quality of intelligence is inversely related to the thermodynamic energy expended during convergence via the Jarzynski equality, and that communicating reduced model priors rather than parameter posteriors constitutes a principled formal account of shared knowledge—consciousness in the pre-Cartesian sense of con-scire.

What to take away

  1. 1. Including posterior beliefs about likelihood parameters **A** in the expected free energy G(π) introduces a novelty term—parameter information gain—that is distinct from epistemic value (state information gain) and risk, and drives policies that expose the agent to novel hidden-state/outcome combinations to resolve ignorance about contingencies.
  2. 2. In a 4-factor rule-learning task with 3 × 3 × 4 × 4 = 144 hidden states and 36 possible outcome combinations, a simulated active inference agent reaches correct, confident performance after approximately 14 trials without Bayesian model reduction.
  3. 3. Applying online Bayesian model reduction after each trial across 64 simulated agents reduces the median trial of insight to approximately 10, with nearly all agents achieving 100% accuracy by that point, compared to chance performance of roughly 30%.
  4. 4. Bayesian model reduction accepts a reduced (simpler) prior when the free energy difference ΔF ≤ −3, corresponding to an odds ratio (Bayes factor) of exp(−3) ≈ 0.05, meaning the reduced model is approximately 20 times more likely than the full model.
  5. 5. The paper implements a sleep analog in which non-REM sleep corresponds to synaptic pruning (zeroing redundant concentration parameters) while REM sleep corresponds to re-evaluating posterior concentration parameters by replaying sampled outcomes under the reduced model, with simulations showing no systematic performance difference between exact REM re-evaluation and the computationally cheaper approximation of reassigning counts to surviving connections.
  6. 6. An open question the paper raises is whether agents anticipating forthcoming data should maintain over-parameterized (full) models as evolutionary endowment—analogous to infant cortical over-connectivity—that are subsequently pruned through experience-dependent plasticity, implying that the appropriate prior model complexity is itself a free parameter requiring empirical investigation.
  7. 7. To replicate the curiosity simulations, a researcher can run the exact belief-update routines in SPM's spm_MDP_VB_X.m (available at http://www.fil.ion.ucl.ac.uk/spm/) by typing >>DEM and selecting the rule-learning demo, with task structure specified entirely by the (A, B, C, D) parameter matrices.
  8. 8. After rule learning, simulated neuronal responses (interpreted as firing rates of units encoding the correct color hidden state) show discrimination onset shifted earlier by multiple saccadic epochs compared to pre-learning trials, providing a specific electrophysiological prediction consistent with ERP studies of insight such as the N380 anterior cingulate component reported by Mai et al. (2004).
  9. 9. Three subjects among 64 simulated agents exhibit superstitious Bayesian model reduction—abducting an informative color mapping in the wrong context (e.g., looking left under the center rule)—producing persistently incorrect behavior, illustrating the principled trade-off between the ampliative benefit of abduction and susceptibility to chance-consistent false models.
  10. 10. The paper predicts that the total thermodynamic energy expended during convergence to an approximately optimal solution is a proxy for the path integral of variational free energy via the Jarzynski equality, implying that deep reinforcement learning systems (e.g., DQN; Mnih et al., 2015) that require large compute budgets are, by this criterion, structurally incompatible with the variational principle of least action that underlies genuine machine intelligence.

Peer brief — for seminar discussion

Friston et al. (Neural Computation 29, 2633–2683, 2017) introduces two generalizations of the active inference framework for discrete Markov decision processes and uses them to give a unified computational account of curiosity, epistemic learning, and insight. The first generalization adds a novelty term to the expected free energy G(π)—specifically the information gain about likelihood parameters **A**, distinct from the existing state-information and risk terms—so that policies which expose the agent to novel combinations of hidden states and outcomes become intrinsically attractive, formalizing curiosity as ignorance-resolution. The second generalization is Bayesian model reduction applied either offline (sleep analog) or online (reflection/aha moments): after accumulating experience, redundant concentration parameters in the likelihood array are pruned whenever the free energy difference ΔF ≤ −3 (Bayes factor ≈ 20), and the resulting simpler, less ambiguous model is adopted as the new prior. The method introduced is the spm_MDP_VB_X.m routine within SPM, applied to a 4-factor abstract rule-learning task with 144 hidden-state combinations, 36 outcome combinations, and up to (4×4)^5 = 1,048,576 possible 5-step policies; the task is intentionally intractable for standard reinforcement learning or belief-state POMDP solvers at this scale. The load-bearing finding is that simulated agents without model reduction require approximately 14 trials to reach asymptotic performance, while online Bayesian model reduction across 64 agents compresses this to roughly 10 trials and near-100% accuracy, with three agents showing superstitious false insights as a predicted side-effect of ampliative abduction. Non-REM sleep is formalized as parameter pruning and REM as posterior re-evaluation via outcome replay; the paper shows these are computationally equivalent to within measurement precision. The paper predicts that aha moments are necessarily subpersonal—optimization of the model cannot itself be modeled at the same level—and invokes the Jarzynski equality to argue that thermodynamic energy cost is a proxy for variational free energy path integrals, making computational efficiency a criterion for genuine intelligence. An alternative method would have been a nonparametric Bayesian approach (e.g., IBP or CRP-based structure learning, as in Collins & Frank, 2013) that constructs model structure from the bottom up rather than pruning a full model top-down; the paper explicitly sets aside this class of methods, acknowledging it avoids hard questions about model-space construction. A critical reader would push back on the model-space specification: the 36 reduced models tested during online abduction were hand-crafted to contain the true model, meaning the paradigm demonstrates model selection within a benignly structured hypothesis space rather than establishing how such a space is generated de novo—the paper's own footnote 3 concedes this, yet the core claims about insight-as-abduction depend on it. The paper's key prediction for empirical work is that insight-dependent performance improvements should correlate with an earlier onset of discriminatory neuronal responses (shifted by at least one saccadic epoch) and that the prevalence of such improvements should roughly double following nocturnal sleep, consistent with Wagner et al. (2004, Nature 427) data cited in support.

Findings (13)

Claims (10)

Questions (7)

Original abstract (expand)

This article offers a formal account of curiosity and insight in terms of active (Bayesian) inference. It deals with the dual problem of inferring states of the world and learning its statistical structure. In contrast to current trends in machine learning (e.g., deep learning), we focus on how people attain insight and understanding using just a handful of observations, which are solicited through curious behavior. We use simulations of abstract rule learning and approximate Bayesian inference to show that minimizing (expected) variational free energy leads to active sampling of novel contingencies. This epistemic behavior closes explanatory gaps in generative models of the world, thereby reducing uncertainty and satisfying curiosity. We then move from epistemic learning to model selection or structure learning to show how abductive processes emerge when agents test plausible hypotheses about symmetries (i.e., invariances or rules) in their generative models. The ensuing Bayesian model reduction evinces mechanisms associated with sleep and has all the hallmarks of "aha" moments. This formulation moves toward a computational account of consciousness in the pre-Cartesian sense of sharable knowledge (i.e., con: "together"; scire: "to know").

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+25 more

Similar preprints — Semantic Scholar

Cited by (1)