Active inference: demystified and compared

ByNoor Sajid·Philip J. Ball·Thomas Parr·Karl J. FristonMachine Learning Research Group, Department of Engineering Science, University of Oxford, Wellcome Centre for Human Neuroimaging + 2 more

DOI 10.1162/neco_a_01357 OpenAlex W3003886363

LLM Interpretability & Behavioral Analysis LLM interpretability & self-awareness Ambiguity Minimization Model Predictive Control Dyna-style planning Bayes-optimal Behavior Monte-Carlo reinforcement learning Non-stationary Environment Value Iteration Rewards reinforce behaviors that secure rewards.Tautology of Reinforcement Learning Temperature / Precision Parameter Temporal Discounting

TL;DR

Active inference agents operating under expected free energy minimization achieve 98.90 [98.00, 99.79] average score in a non-stationary FrozenLake OpenAI gym environment, compared to 64.39 [60.33, 68.44] for Bayesian model-based RL with Thompson sampling and 66.08 [63.28, 68.88] for Q-learning (ε=0.1) — a performance gap that emerges specifically because active inference treats environmental change as a context-inference problem rather than a reversal-learning problem, recovering within a single episode after each goal-hole swap. The paper introduces a discrete state-space and time formulation of active inference as its primary expository instrument, decomposing expected free energy G into an epistemic value term (mutual information between outcomes and hidden states) and an extrinsic value term (KL divergence between predicted and preferred outcomes), showing that both exploration and exploitation are expressions of a single objective rather than requiring separate engineering via ε-greedy schedules or temperature hyperparameters. In reward-free conditions where Q-learning freezes into a deterministic circular policy scoring 0.00, the active inference null model (zero prior preferences) still scores 50.03 [49.70, 50.35] through pure information-seeking, and agents equipped with Dirichlet hyperpriors over outcome preferences learn stable behavioral niches — including counter-intuitive hole-seeking — without any external reward signal. The paper argues this implies that reinforcement learning is a limiting special case of active inference in which the epistemic value term is suppressed and preferences are fixed externally, and that reward-free preference learning dissolves the circularity of the reward hypothesis rather than merely circumventing it.

What to take away

1. Active inference agents scored 98.90 [98.00, 99.79] in the non-stationary FrozenLake environment versus 64.39 [60.33, 68.44] for Bayesian RL with Thompson sampling and 66.08 [63.28, 68.88] for Q-learning (ε=0.1), with the gap attributable to active inference treating context switches as planning-as-inference rather than reversal learning.
2. In the stationary FrozenLake environment all agents except the null model scored above 80 (active inference: 99.88 [99.64, 100.00]; Bayesian RL: 99.76 [99.45, 100.00]; Q-learning ε=0.1: 97.79 [97.41, 98.16]), showing that the performance advantage of active inference is specific to non-stationarity.
3. Expected free energy G decomposes into a negative mutual information term (epistemic value, driving exploration) and a negative log preference term (extrinsic value, driving exploitation), so the exploration-exploitation trade-off emerges from a single objective without requiring a separate temperature hyperparameter.
4. When all prior preferences are set to zero (the null active inference model), the agent scores 50.03 [49.70, 50.35] across both environments and produces non-overlapping exploratory trajectories covering all uncharted states, a behavior that cannot be motivated within standard RL because there is no reward signal to maximize.
5. Reward shaping experiments across 100 agents over 100 episodes show that Bayesian RL and active inference produce nearly identical average scores when prior preferences are expressed as log-probability equivalents of reward functions, demonstrating that RL is a limiting case of active inference when epistemic value is suppressed.
6. Active inference agents equipped with Dirichlet hyperpriors over outcome preferences (initialized flat at concentration parameter 1) learn stable hole-seeking or Frisbee-seeking niches within 5 episodes by accumulating experience as pseudo-counts, converging on preferences that reflect the outcomes the agent recurrently encounters rather than any externally specified reward.
7. The Bayesian RL agent uses a Dyna-style architecture with k=50 Thompson samples from Beta(α,β) priors over transition and reward models, yet fails to recover quickly in non-stationary conditions because accumulated pseudo-counts make prior reversal increasingly costly with each context switch.
8. A replicable methodology choice is to implement policy pruning by discarding any policy whose log evidence is 20 times less than the most probable policy (ζ=20), with gradient descent step size ζ=4, run for 200 trials of 500 episodes each to obtain stable 95% confidence intervals.
9. An open question the paper raises is whether more complex Bayesian RL agents with explicit latent context variables — rather than implicit context encoding via reward location — would match active inference's non-stationary performance, which the authors identify as an outstanding research question.
10. The paper predicts that if an appropriate generative model can be formulated, the discrete state-space active inference scheme demonstrated on FrozenLake extends directly to continuous-state domains including robotic arm movement and Atari games, via mixed discrete-continuous or generalized-coordinates-of-motion models.

Peer brief — for seminar discussion

Sajid et al. (2021, Neural Computation) provide a tutorial-style derivation of the discrete state-space active inference formulation and pit it against two RL baselines — Q-learning with ε-greedy exploration and a Dyna-style Bayesian RL agent using Thompson sampling — on a modified 3×3 FrozenLake OpenAI gym environment run for 200 trials of 500 episodes each. The generative model encodes 18 hidden states (9 locations × 2 contexts), 4 action states, and 2 outcome modalities (grid position and score), with utility defined as ±4 nats for rewarding and unrewarding outcomes. The key expository instrument is the expected free energy decomposition G = –epistemic value – extrinsic value, which naturally yields exploration-exploitation arbitration without a separate temperature hyperparameter; the paper calls the resulting behavior 'Bayes-optimal' in the sense of minimizing free energy over distal time horizons rather than maximizing Bellman-optimal returns. The load-bearing empirical finding is that in a non-stationary environment (goal-hole positions swapped at episodes 21, 121, 141, 251, and 451), active inference recovers to near-ceiling performance within a single episode after each switch, scoring 98.90 [98.00, 99.79], while Bayesian RL collapses to 64.39 [60.33, 68.44] and Q-learning (ε=0.1) to 66.08 [63.28, 68.88]. The explanation is structural: active inference frames context changes as inference over latent context variables, whereas Bayesian RL with Beta(α,β) pseudo-counts must reverse accumulated evidence, a process that grows more expensive with each successive switch. In reward-free conditions, Q-learning produces a zero-score deterministic circular policy, while the active inference null model (zero prior preferences) scores 50.03 [49.70, 50.35] through pure epistemic foraging. The paper further demonstrates that agents equipped with Dirichlet hyperpriors (concentration parameter 1, initialized flat) over outcome preferences learn stable behavioral niches — including hole-seeking — within approximately 5 episodes, without any external reward signal, making the normative prediction that an agent's preferences are a description of what kind of creature it is, not a signal from the environment. What this implies, the authors argue, is that standard RL is a limiting special case of active inference: suppress the epistemic value term and fix preferences externally, and you recover reward maximization. The alternative method the paper could have used for behavioral comparison — but explicitly defers — is a context-aware Bayesian RL agent with explicit latent context variables (e.g., VARIBAD-style meta-RL), which might close the non-stationary performance gap. The contestable point a critical reader should press is the FrozenLake environment's very low dimensionality and the fact that the active inference generative model is hand-crafted with precise prior concentration parameters of 100 and explicit context factors that directly encode the non-stationarity structure. The RL baselines are not given equivalent structural priors about context: Bayesian RL encodes context implicitly via the reward location prior, and Q-learning has no context model at all. The non-stationary advantage may therefore reflect a modeling asymmetry rather than a fundamental algorithmic superiority of free-energy minimization — a confound the paper acknowledges only briefly before conceding it as an open research question.

Methods (3)

Dyna-style planning
A model-based RL architecture that interleaves direct policy learning with hypothetical roll-outs from a learned model.
Monte-Carlo reinforcement learning
Reinforcement learning methods that update parameters at the end of an episode based on sampled returns.
Value Iteration
A dynamic programming method for computing optimal value functions and policies in known MDPs.

Frameworks (1)

Model Predictive Control
A control approach that uses a model to predict future states and optimizes control actions over a receding horizon.

Findings (12)

All three agent types (active inference, Q-learning, Bayesian RL) perform adequately in stationary FrozenLake; only active inference achieves Bayes-optimal behavior in non-stationary settings.
Key empirical result validating online planning capability of active inference.
In the absence of prior preferences, Active Inference null model and Bayesian RL maintain exploration with average scores of 44.00 and 39.94 respectively, whereas Q-learning does not explore.
Table 2 first row; reward shaping section.
Active inference agent with learnable preferences developed a strict preference for goals (score +) when the Frisbee location was encountered first, becoming a goal-seeking agent.
Figure 5.4 and text.
Under reward shaping (G=100, H=-100, F=0), Active Inference scored 99.52, Bayesian RL 99.77, Q-learning 95.56, with nearly identical behavior between belief-based agents.
Table 2, row 3, showing equivalence when prior preferences match rewards.
Active inference agents engage in information-seeking behavior in reward-free FrozenLake environments, contrasting with Q-learning but similar to Bayesian RL.
Empirical demonstration on FrozenLake; shows epistemic value drives exploration absent reward signal.
In the absence of any reward signal, Q-learning (epsilon=0.1) learns a deterministic circular policy with score 0.00 and does not explore purposefully.
Table 2 first row; reward shaping section.
Active inference recovers performance within 1 episode after context switch in non-stationary FrozenLake, while Bayesian RL requires ~40 episodes.
Figure 4 and discussion in §3.
Active Inference agent achieved average score 99.88 [99.64, 100.00] in deterministic FrozenLake environment across 200 trials of 500 episodes.
Table 1, deterministic environment row.
Active Inference null model (no prior preferences) achieved average score 50.03 [49.70, 50.35] in deterministic FrozenLake.
Table 1.
Active inference and Bayesian model-based RL learn reward-maximizing behavior in <10 episodes in deterministic FrozenLake.
Discussion of Figure 3.

Claims (17)

Reinforcement learning can be regarded as a limiting or special case of model-based approaches in general — or active inference in particular — when epistemic value is removed.
§3 Discussion.
There is an implicit behavioral equivalence between Bayesian model-based reinforcement learning and active inference when prior preferences are treated as a reward function.
§3, reward shaping conclusion.
The elimination of reward as a motivator of behavior with prior beliefs dissolves the tautology of reinforcement learning (rewards reinforce behaviors that secure rewards).
§4 Discussion.
Active inference agents can learn their own reward function (prior preferences) by interacting with the environment, bypassing the need for an explicit reward signal.
Abstract and §3, preference learning section.
The natural curiosity emerging in active inference contrasts with handcrafted exploration in reinforcement learning such as epsilon-greedy or ad hoc novelty bonuses.
§2, comparing exploration mechanisms.
Temporal discounting emerges naturally from active inference without an explicit discount factor, because predictions in the distant future are less precise.
§2, discussion of precision.
Active inference offers an attractive natural adaptation mechanism for non-stationary environments due to its Bayesian model updating properties.
§3, after non-stationary results.
Active inference agents can carry out epistemic exploration and account for uncertainty about their environment in a Bayes-optimal fashion.
Abstract and §1, summarizing a key property.
Active inference provides a framework (derived from first principles) for solving and understanding the behavior of autonomous agents.
Natural exploration-exploitation trade-offs emerge automatically from expected free energy minimization without hyperparameter tuning.
Active inference achieves Bayes-optimal arbitration between exploration and exploitation without handcrafted mechanisms like ε-greedy.

Hypotheses (2)

If epistemic value is removed from expected free energy, the resulting objective reduces to maximizing expected future reward (pragmatic value).
Stated as conditional statement explaining the special case whence RL emerges.
Active inference achieves Bayes-optimal behavior in non-stationary environments through online belief updating.
Tested via FrozenLake experiments; predicts superior performance when environment dynamics change.

Questions (3)

How does active inference compare to reinforcement learning in environments with no rewards or uninformative prior preferences?
Core question addressed by the simulations when rewards are removed.
How can reward functions be meaningfully specified when the same outcome may be valuable or detrimental depending on context?
Motivates active inference's solution: learning prior preferences from interaction rather than external specification.
Can active inference agents learn their own prior preferences without explicit reward signals?
Question answered by the preference learning experiments.

Original abstract (expand)

Active inference is a first principle account of how autonomous agents operate in dynamic, nonstationary environments. This problem is also considered in reinforcement learning, but limited work exists on comparing the two approaches on the same discrete-state environments. In this letter, we provide (1) an accessible overview of the discrete-state formulation of active inference, highlighting natural behaviors in active inference that are generally engineered in reinforcement learning, and (2) an explicit discrete-state comparison between active inference and reinforcement learning on an OpenAI gym baseline. We begin by providing a condensed overview of the active inference literature, in particular viewing the various natural behaviors of active inference agents through the lens of reinforcement learning. We show that by operating in a pure belief-based setting, active inference agents can carry out epistemic exploration-and account for uncertainty about their environment-in a Bayes-optimal fashion. Furthermore, we show that the reliance on an explicit reward signal in reinforcement learning is removed in active inference, where reward can simply be treated as another observation we have a preference over; even in the total absence of rewards, agent behaviors are learned through preference learning. We make these properties explicit by showing two scenarios in which active inference agents can infer behaviors in reward-free environments compared to both Q-learning and Bayesian model-based reinforcement learning agents and by placing zero prior preferences over rewards and learning the prior preferences over the observations corresponding to reward. We conclude by noting that this formalism can be applied to more complex settings (e.g., robotic arm movement, Atari games) if appropriate generative models can be formulated. In short, we aim to demystify the behavior of active inference agents by presenting an accessible discrete state-space and time formulation and demonstrate these behaviors in a OpenAI gym environment, alongside reinforcement learning agents.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Active inference: demystified and compared
Philip J. Ball, Thomas Parr, Karl J. Friston Noor Sajid
2021
≈ 92%
Active Inference: A Process Theory
cited
in corpus
2017
≈ 87%
Online reinforcement learning with sparse rewards through an active inference capsule
Charel van Hoof (1), Beren Millidge (2) ((1) Delft University of Technology, (2) University of Oxford) Alejandro Daniel Noel (1)
2021
≈ 90%
Prior Preference Learning from Experts:Designing a Reward with Active Inference
Cheolhyeong Kim, Hyung Ju Hwang Jin young Shin
2021
≈ 89%
Active Inference and Reinforcement Learning: A unified inference on continuous state and action spaces under partial observability
Parvin Malekzadeh and Konstantinos N. Plataniotis
2024
≈ 89%
Bayesian policy selection using active inference
Johannes Nauta, Tim Verbelen, Pieter Simoens and Bart Dhoedt Ozan \c{C}atal
2019
≈ 89%
Reinforcement Learning through Active Inference
Beren Millidge, Anil K. Seth, Christopher L. Buckley Alexander Tschantz
2020
≈ 89%
Active Inference as a Model of Agency
Samuel Tenka, Dominic Zhao, Noor Sajid Lancelot Da Costa
2024
≈ 89%
Active inference and artificial reasoning
Lancelot Da Costa, Alexander Tschantz, Conor Heins, Christopher Buckley, Tim Verbelen, Thomas Parr Karl Friston
2025
≈ 89%
Reward Maximisation through Discrete Active Inference
Noor Sajid, Thomas Parr, Karl Friston, Ryan Smith Lancelot Da Costa
2022
≈ 88%
Contrastive Active Inference
Pietro Mazzaglia and Tim Verbelen and Bart Dhoedt
2024
≈ 88%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 88%
Active inference, Bayesian optimal design, and expected utility
Lancelot Da Costa, Thomas Parr, Karl Friston Noor Sajid
2021
≈ 88%
Prior preferences in active inference agents: soft, hard, and goal shaping
Ryota Kanai, Manuel Baltieri Filippo Torresan
2025
≈ 88%
Reframing the Expected Free Energy: Four Formulations and a Unification
Howard Bowman, Dimitrije Markovi\'c, Marek Grze\'s Th\'eophile Champion
2024
≈ 88%
Active inference for action-unaware agents
Keisuke Suzuki, Ryota Kanai, Manuel Baltieri Filippo Torresan
2025
≈ 88%
Active inference and epistemic value
cited
2015
≈ 88%
Deconstructing deep active inference
Th\'eophile Champion and Marek Grze\'s and Lisa Bonheme and Howard Bowman
2023
≈ 87%
Active inference and learning
cited
2016
≈ 86%
Active Inference, Curiosity and Insight
in corpus
2017
≈ 85%
The anatomy of choice: dopamine and decision-making
cited
2014
≈ 85%
Active inference and agency: optimal control without cost functions
cited
2012
≈ 85%
A tale of two densities: active inference is enactive inference
in corpus
2020
≈ 85%
A Free energy principle for the brain (lecture summary)
in corpus
2008
≈ 83%
Scene Construction, Visual Foraging, and Active Inference
cited
2016
≈ 83%
Optimal inference with suboptimal models: Addiction and active Bayesian inference
cited
2014
≈ 81%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 81%
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
in corpus
2025
≈ 80%
Uncertainty, epistemics and active inference
cited
2017
≈ 79%
Life as we know it
in corpus
2013
≈ 79%

+19 more

Similar preprints — Semantic Scholar

Cross-corpus bridges (1)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

aboutblank_kb
Active Inferenceframeworks/active-inference.md0.868