paper
active
2021
135
paper:sajid-2021-active-inference-demystified

Active inference: demystified and compared

TL;DR

Active inference agents operating under expected free energy minimization achieve 98.90 [98.00, 99.79] average score in a non-stationary FrozenLake OpenAI gym environment, compared to 64.39 [60.33, 68.44] for Bayesian model-based RL with Thompson sampling and 66.08 [63.28, 68.88] for Q-learning (ε=0.1) — a performance gap that emerges specifically because active inference treats environmental change as a context-inference problem rather than a reversal-learning problem, recovering within a single episode after each goal-hole swap. The paper introduces a discrete state-space and time formulation of active inference as its primary expository instrument, decomposing expected free energy G into an epistemic value term (mutual information between outcomes and hidden states) and an extrinsic value term (KL divergence between predicted and preferred outcomes), showing that both exploration and exploitation are expressions of a single objective rather than requiring separate engineering via ε-greedy schedules or temperature hyperparameters. In reward-free conditions where Q-learning freezes into a deterministic circular policy scoring 0.00, the active inference null model (zero prior preferences) still scores 50.03 [49.70, 50.35] through pure information-seeking, and agents equipped with Dirichlet hyperpriors over outcome preferences learn stable behavioral niches — including counter-intuitive hole-seeking — without any external reward signal. The paper argues this implies that reinforcement learning is a limiting special case of active inference in which the epistemic value term is suppressed and preferences are fixed externally, and that reward-free preference learning dissolves the circularity of the reward hypothesis rather than merely circumventing it.

What to take away

  1. 1. Active inference agents scored 98.90 [98.00, 99.79] in the non-stationary FrozenLake environment versus 64.39 [60.33, 68.44] for Bayesian RL with Thompson sampling and 66.08 [63.28, 68.88] for Q-learning (ε=0.1), with the gap attributable to active inference treating context switches as planning-as-inference rather than reversal learning.
  2. 2. In the stationary FrozenLake environment all agents except the null model scored above 80 (active inference: 99.88 [99.64, 100.00]; Bayesian RL: 99.76 [99.45, 100.00]; Q-learning ε=0.1: 97.79 [97.41, 98.16]), showing that the performance advantage of active inference is specific to non-stationarity.
  3. 3. Expected free energy G decomposes into a negative mutual information term (epistemic value, driving exploration) and a negative log preference term (extrinsic value, driving exploitation), so the exploration-exploitation trade-off emerges from a single objective without requiring a separate temperature hyperparameter.
  4. 4. When all prior preferences are set to zero (the null active inference model), the agent scores 50.03 [49.70, 50.35] across both environments and produces non-overlapping exploratory trajectories covering all uncharted states, a behavior that cannot be motivated within standard RL because there is no reward signal to maximize.
  5. 5. Reward shaping experiments across 100 agents over 100 episodes show that Bayesian RL and active inference produce nearly identical average scores when prior preferences are expressed as log-probability equivalents of reward functions, demonstrating that RL is a limiting case of active inference when epistemic value is suppressed.
  6. 6. Active inference agents equipped with Dirichlet hyperpriors over outcome preferences (initialized flat at concentration parameter 1) learn stable hole-seeking or Frisbee-seeking niches within 5 episodes by accumulating experience as pseudo-counts, converging on preferences that reflect the outcomes the agent recurrently encounters rather than any externally specified reward.
  7. 7. The Bayesian RL agent uses a Dyna-style architecture with k=50 Thompson samples from Beta(α,β) priors over transition and reward models, yet fails to recover quickly in non-stationary conditions because accumulated pseudo-counts make prior reversal increasingly costly with each context switch.
  8. 8. A replicable methodology choice is to implement policy pruning by discarding any policy whose log evidence is 20 times less than the most probable policy (ζ=20), with gradient descent step size ζ=4, run for 200 trials of 500 episodes each to obtain stable 95% confidence intervals.
  9. 9. An open question the paper raises is whether more complex Bayesian RL agents with explicit latent context variables — rather than implicit context encoding via reward location — would match active inference's non-stationary performance, which the authors identify as an outstanding research question.
  10. 10. The paper predicts that if an appropriate generative model can be formulated, the discrete state-space active inference scheme demonstrated on FrozenLake extends directly to continuous-state domains including robotic arm movement and Atari games, via mixed discrete-continuous or generalized-coordinates-of-motion models.

Peer brief — for seminar discussion

Sajid et al. (2021, Neural Computation) provide a tutorial-style derivation of the discrete state-space active inference formulation and pit it against two RL baselines — Q-learning with ε-greedy exploration and a Dyna-style Bayesian RL agent using Thompson sampling — on a modified 3×3 FrozenLake OpenAI gym environment run for 200 trials of 500 episodes each. The generative model encodes 18 hidden states (9 locations × 2 contexts), 4 action states, and 2 outcome modalities (grid position and score), with utility defined as ±4 nats for rewarding and unrewarding outcomes. The key expository instrument is the expected free energy decomposition G = –epistemic value – extrinsic value, which naturally yields exploration-exploitation arbitration without a separate temperature hyperparameter; the paper calls the resulting behavior 'Bayes-optimal' in the sense of minimizing free energy over distal time horizons rather than maximizing Bellman-optimal returns. The load-bearing empirical finding is that in a non-stationary environment (goal-hole positions swapped at episodes 21, 121, 141, 251, and 451), active inference recovers to near-ceiling performance within a single episode after each switch, scoring 98.90 [98.00, 99.79], while Bayesian RL collapses to 64.39 [60.33, 68.44] and Q-learning (ε=0.1) to 66.08 [63.28, 68.88]. The explanation is structural: active inference frames context changes as inference over latent context variables, whereas Bayesian RL with Beta(α,β) pseudo-counts must reverse accumulated evidence, a process that grows more expensive with each successive switch. In reward-free conditions, Q-learning produces a zero-score deterministic circular policy, while the active inference null model (zero prior preferences) scores 50.03 [49.70, 50.35] through pure epistemic foraging. The paper further demonstrates that agents equipped with Dirichlet hyperpriors (concentration parameter 1, initialized flat) over outcome preferences learn stable behavioral niches — including hole-seeking — within approximately 5 episodes, without any external reward signal, making the normative prediction that an agent's preferences are a description of what kind of creature it is, not a signal from the environment. What this implies, the authors argue, is that standard RL is a limiting special case of active inference: suppress the epistemic value term and fix preferences externally, and you recover reward maximization. The alternative method the paper could have used for behavioral comparison — but explicitly defers — is a context-aware Bayesian RL agent with explicit latent context variables (e.g., VARIBAD-style meta-RL), which might close the non-stationary performance gap. The contestable point a critical reader should press is the FrozenLake environment's very low dimensionality and the fact that the active inference generative model is hand-crafted with precise prior concentration parameters of 100 and explicit context factors that directly encode the non-stationarity structure. The RL baselines are not given equivalent structural priors about context: Bayesian RL encodes context implicitly via the reward location prior, and Q-learning has no context model at all. The non-stationary advantage may therefore reflect a modeling asymmetry rather than a fundamental algorithmic superiority of free-energy minimization — a confound the paper acknowledges only briefly before conceding it as an open research question.

Methods (3)

  • Dyna-style planning
    A model-based RL architecture that interleaves direct policy learning with hypothetical roll-outs from a learned model.
  • Monte-Carlo reinforcement learning
    Reinforcement learning methods that update parameters at the end of an episode based on sampled returns.
  • Value Iteration
    A dynamic programming method for computing optimal value functions and policies in known MDPs.

Frameworks (1)

  • Model Predictive Control
    A control approach that uses a model to predict future states and optimizes control actions over a receding horizon.

Findings (12)

Claims (17)

Hypotheses (2)

Questions (3)

Original abstract (expand)

Active inference is a first principle account of how autonomous agents operate in dynamic, nonstationary environments. This problem is also considered in reinforcement learning, but limited work exists on comparing the two approaches on the same discrete-state environments. In this letter, we provide (1) an accessible overview of the discrete-state formulation of active inference, highlighting natural behaviors in active inference that are generally engineered in reinforcement learning, and (2) an explicit discrete-state comparison between active inference and reinforcement learning on an OpenAI gym baseline. We begin by providing a condensed overview of the active inference literature, in particular viewing the various natural behaviors of active inference agents through the lens of reinforcement learning. We show that by operating in a pure belief-based setting, active inference agents can carry out epistemic exploration-and account for uncertainty about their environment-in a Bayes-optimal fashion. Furthermore, we show that the reliance on an explicit reward signal in reinforcement learning is removed in active inference, where reward can simply be treated as another observation we have a preference over; even in the total absence of rewards, agent behaviors are learned through preference learning. We make these properties explicit by showing two scenarios in which active inference agents can infer behaviors in reward-free environments compared to both Q-learning and Bayesian model-based reinforcement learning agents and by placing zero prior preferences over rewards and learning the prior preferences over the observations corresponding to reward. We conclude by noting that this formalism can be applied to more complex settings (e.g., robotic arm movement, Atari games) if appropriate generative models can be formulated. In short, we aim to demystify the behavior of active inference agents by presenting an accessible discrete state-space and time formulation and demonstrate these behaviors in a OpenAI gym environment, alongside reinforcement learning agents.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+19 more

Similar preprints — Semantic Scholar

Cross-corpus bridges (1)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

  • aboutblank_kb
    Active Inferenceframeworks/active-inference.md0.868