Expected Free Energy: The Objective Function That Plans

Most AI optimizes reward. Active inference optimizes expected free energy — a quantity that naturally balances exploration and exploitation, curiosity and goal-seeking, without any explicit trade-off parameter.

Planning through expected free energy: epistemic and pragmatic value unified.

Expected Free Energy: The Objective Function That Plans

Series: Active Inference Applied | Part: 3 of 10

Planning is not optional. Every system that persists across time must select actions that keep it viable. Bacteria swim toward nutrients. Predators stalk prey. You decide whether to read this sentence or scroll past it. These are not merely behavioral outputs—they are solutions to an optimization problem. The question is: what exactly is being optimized?

In classical reinforcement learning, the answer is straightforward: maximize expected reward. But this creates an immediate problem. How does the system know what's rewarding before it's experienced it? How does exploration happen at all? And why would a system ever choose an action that reduces immediate reward for the sake of information gain?

Expected free energy is the answer active inference offers. It's not just a reward function dressed up in Bayesian clothing. It's an objective that unifies exploitation and exploration, pragmatic action and epistemic curiosity, planning and learning—all in a single mathematical quantity that systems naturally minimize.

This is not abstract philosophy. This is the function that makes planning work.

What Makes a Good Objective Function?

Before we dive into expected free energy itself, we need to understand what we're asking an objective function to do. In decision theory, an objective function assigns a value to each possible action. The agent selects the action that optimizes this value—usually by maximizing expected reward or minimizing expected cost.

Classical approaches split this into two separate problems:

Exploitation: Choose actions that maximize expected reward given current knowledge
Exploration: Choose actions that reduce uncertainty about the world

This split creates what's known as the exploration-exploitation dilemma. Should the agent do what it knows works (exploit) or try something new to learn more (explore)? Various heuristics exist—epsilon-greedy policies, upper confidence bounds, Thompson sampling—but these are engineering workarounds, not principles.

Active inference dissolves this dilemma by recognizing that both exploitation and exploration serve the same underlying imperative: minimizing surprise over time. Expected free energy captures this formally.

An agent minimizing expected free energy doesn't need separate mechanisms for curiosity and goal-seeking. Both emerge from the same drive: keeping sensory states predictable and consistent with the agent's generative model.

Expected Free Energy: The Definition

Expected free energy (EFE) is the free energy expected under a policy—a sequence of actions—evaluated from the agent's current beliefs.

Formally, for a policy $\pi$ and future time $\tau$, expected free energy is:

$$
G(\pi, \tau) = \mathbb{E}{Q(o\tau, s_\tau | \pi)} \left[ \log Q(s_\tau | \pi) - \log P(o_\tau, s_\tau | C) \right]
$$

Where:

$Q(o_\tau, s_\tau | \pi)$ is the agent's predicted distribution over observations $o$ and states $s$ under policy $\pi$
$P(o_\tau, s_\tau | C)$ is the generative model's joint distribution, conditioned on the agent's preferences $C$

This looks dense. Let's unpack it.

The first term, $\log Q(s_\tau | \pi)$, is the entropy of predicted states. High entropy means high uncertainty about where the system will be if it follows policy $\pi$. Minimizing this term means preferring policies that lead to predictable states.

The second term, $-\log P(o_\tau, s_\tau | C)$, is the negative log probability of observations and states under the generative model. This encodes the agent's preferences. If the agent prefers certain observations (encoded in $C$), minimizing this term means selecting policies that make those preferred observations likely.

Together, these terms balance two objectives:

Minimize uncertainty about future states (epistemic value)
Maximize alignment with preferences (pragmatic value)

This is not a hand-tuned trade-off. It's a natural consequence of minimizing variational free energy in the future.

The Two Components: Epistemic and Pragmatic Value

Expected free energy can be decomposed into two interpretable parts:

$$
G(\pi, \tau) = \underbrace{\mathbb{E}{Q} [ \log Q(s\tau | \pi) - \log Q(s_\tau | o_\tau, \pi) ]}{\text{Epistemic value}} + \underbrace{\mathbb{E}{Q} [ \log Q(o_\tau | \pi) - \log P(o_\tau | C) ]}_{\text{Pragmatic value}}
$$

Epistemic Value: The Drive to Learn

The epistemic component is the expected information gain about hidden states. It measures how much observing the outcome of a policy is expected to reduce uncertainty.

Mathematically, this is the expected KL divergence between the prior over states $Q(s_\tau | \pi)$ and the posterior after observing $Q(s_\tau | o_\tau, \pi)$.

In plain language: Will this action teach me something?

Policies that reduce ambiguity about the world have high epistemic value. This is intrinsic motivation formalized. The agent doesn't need a separate curiosity reward—it's baked into the objective function.

This is why active inference agents naturally explore. If the generative model is uncertain about how observations map to states (high ambiguity), policies that disambiguate this mapping have low expected free energy. The agent is drawn toward informative actions not because of an ad hoc bonus, but because resolving uncertainty lowers future free energy.

Pragmatic Value: The Drive to Succeed

The pragmatic component is the expected divergence between predicted and preferred observations. It measures how closely the policy's anticipated outcomes align with what the agent "wants."

This is where goals live. If the agent has a prior preference $P(o | C)$ for certain observations—low prediction error states, homeostatic ranges, goal configurations—then policies that make those observations likely have low pragmatic cost.

In plain language: Will this action get me what I want?

Classical reward-based agents only have this component. Active inference agents have it plus epistemic value. This is not a trivial addition. It fundamentally changes what planning looks like.

Why This Matters: Balancing Exploitation and Exploration

Consider a robot trying to navigate a maze.

Under a classical reward-maximizing policy:

The robot receives positive reward for reaching the goal
It receives zero reward for everything else
Early in learning, random exploration eventually stumbles on the goal
Once the goal is found, the robot exploits the known path
The robot has no reason to explore alternative routes unless forced by an epsilon-greedy policy

Under expected free energy minimization:

The robot has a preference for observations consistent with reaching the goal (pragmatic value)
The robot also has a preference for reducing uncertainty about the maze structure (epistemic value)
If two paths both lead to the goal, but one passes through unexplored territory, the robot may prefer the unexplored path because observing new areas reduces model uncertainty
The balance between epistemic and pragmatic value is not hand-tuned—it emerges from the precision (confidence) of the agent's beliefs

This is not just more elegant. It's more sample-efficient. Active inference agents explore where it matters—in regions where the model is uncertain and where resolving uncertainty impacts future planning.

The Role of Precision

One crucial detail: expected free energy is weighted by precision—the inverse variance of beliefs. High-precision beliefs mean the agent is confident. Low-precision beliefs mean the agent is uncertain.

This precision modulates the balance between epistemic and pragmatic components. When the agent is uncertain (low precision), epistemic value dominates—explore to learn. When the agent is confident (high precision), pragmatic value dominates—exploit what you know.

This is temperature in Bayesian decision theory, but it's not a hyperparameter you tune. It's estimated from the data via precision-weighted prediction errors. The agent learns when to explore and when to exploit based on its own uncertainty.

Dopamine is hypothesized to encode precision in biological systems (Friston et al., 2014). When dopamine is high, the agent trusts its predictions and acts decisively. When dopamine is low, the agent is uncertain and explores. This isn't speculative neuroscience—it maps directly onto the computational role of precision in active inference.

Planning as Policy Selection

In active inference, planning is selecting the policy $\pi$ that minimizes expected free energy:

$$
\pi^* = \arg\min_\pi \sum_{\tau} G(\pi, \tau)
$$

This is a softmax over policies, weighted by expected free energy. The agent doesn't pick the single best policy deterministically. It samples policies proportional to their expected free energy—policies with lower EFE are exponentially more likely to be selected.

This probabilistic policy selection has interesting consequences:

The agent can exhibit stochastic exploration without epsilon-greedy hacks
The agent can shift between policies smoothly as beliefs update
The agent naturally avoids over-committing to plans when uncertainty is high

Importantly, this is not trajectory optimization in the classical sense. The agent isn't searching for a single optimal action sequence. It's maintaining a distribution over policies and selecting actions that minimize expected free energy on average.

Expected Free Energy in the Wild: Empirical Examples

This is not just theory. Expected free energy has been implemented in working systems across domains.

Foraging Behavior

Simulated agents using EFE minimization reproduce classic foraging patterns: optimal patch-leaving times, Lévy flight search strategies, and risk-sensitive behavior (Parr & Friston, 2019). These emerge not from hand-coded heuristics but from balancing epistemic and pragmatic value.

When resources are uncertain (low precision), the agent explores widely. When resources are known (high precision), the agent exploits locally. This matches animal foraging behavior across species.

Saccadic Eye Movements

Human eye movements minimize expected free energy. When viewing a scene, saccades (rapid eye movements) preferentially land on regions that maximize information gain—edges, high-contrast areas, salient objects (Friston et al., 2012).

This is not "attention" as a separate mechanism. It's expected free energy doing exactly what it's supposed to: directing sensors toward observations that reduce uncertainty.

Active Learning

In machine learning, active learning algorithms select which data points to label next. EFE-based active learning outperforms uncertainty sampling and query-by-committee methods because it balances reducing model uncertainty with improving task performance (Baldi & Itti, 2010; Schmidhuber, 2010).

Classical active learning separates exploration (uncertainty sampling) from exploitation (label points near decision boundaries). EFE unifies them.

How This Differs from Reward-Based Planning

The difference between expected free energy and expected reward is not cosmetic. They produce qualitatively different behaviors.

Classical RL (Expected Reward)	Active Inference (Expected Free Energy)
Maximize future reward	Minimize future surprise
Exploration is a problem	Exploration is intrinsic
Epistemic and pragmatic drives separate	Epistemic and pragmatic unified
Curiosity is a hack (intrinsic reward)	Curiosity is structural (information gain)
Precision is a hyperparameter	Precision is learned
Policy is deterministic (greedy) or stochastic (epsilon)	Policy is probabilistic (softmax over EFE)

This isn't just cleaner theory. It's more biologically plausible and more sample-efficient in sparse reward environments.

The Connection to Coherence

In the AToM framework, coherence is the property of systems maintaining their boundaries over time. Expected free energy is the local mechanism that implements coherence at the decision-making scale.

By minimizing expected free energy, an agent keeps its sensory states within predictable bounds—what Friston calls the agent's characteristic states. These are the states the agent expects to occupy, encoded in its generative model.

Policies that take the agent far from characteristic states (high surprise) have high expected free energy. Policies that keep the agent within familiar, viable states have low expected free energy.

This is not homeostasis as passive regulation. This is homeostasis as active selection—choosing actions that keep the system coherent. And because the epistemic component is present, this includes learning better models of what states are characteristic, not just slavishly maintaining current set points.

Meaning, in AToM terms, is $M = C/T$—coherence sustained over time. Expected free energy is how agents implement this at the algorithmic level. They select policies that maximize the expected coherence of future states, weighted by their confidence in those predictions.

Practical Implementation: Computing EFE

How do you actually compute expected free energy in code?

The straightforward approach:

For each candidate policy $\pi$, simulate forward in time using the generative model
Compute the distribution $Q(o_\tau, s_\tau | \pi)$ for each future timestep $\tau$
Evaluate epistemic and pragmatic components
Sum over time horizon
Select policy via softmax over negative EFE

This is tractable for discrete state spaces and short horizons. For continuous states or long horizons, approximations are necessary:

Monte Carlo sampling of future trajectories
Variational approximations to the posterior over states
Amortized inference using neural networks to predict EFE directly

PyMDP implements discrete-state EFE minimization with message passing. RxInfer.jl extends this to continuous states using reactive message passing on factor graphs.

The next article in this series will cover message passing and belief propagation—the computational machinery that makes EFE minimization efficient.

Why "Free Energy" and Not Just "Expected Utility"?

You might ask: why the terminology baggage? Why not just call this expected utility and be done with it?

Because expected free energy is not derived from utility theory. It's derived from variational inference. The agent isn't maximizing preferences—it's minimizing the divergence between its predictions and its model.

This has deep consequences:

Preferences are not primitive—they're encoded as priors over observations
The objective is not arbitrary—it's the unique functional that makes self-evidencing systems persist
The mathematics connects to thermodynamics, information theory, and statistical physics in ways utility theory does not

Calling it "expected free energy" signals that this is part of a larger framework: the Free Energy Principle, which claims that all self-organizing systems minimize variational free energy. Expected free energy is the forward-looking version—what happens when you apply the same principle to future observations, not just current ones.

Challenges and Open Questions

Expected free energy is elegant, but implementing it in complex domains is non-trivial.

Computational cost: Evaluating EFE requires simulating all candidate policies forward in time. For large action spaces or long horizons, this is intractable without approximations.

Generative model specification: The agent needs a generative model $P(o, s)$. If this model is wrong or incomplete, EFE minimization can lead to suboptimal behavior. Learning accurate generative models from data is itself a hard problem.

Preference specification: What observations should the agent prefer? In some domains (homeostasis, survival), this is clear. In others (open-ended tasks, creative exploration), it's less obvious. Misspecified preferences lead to misaligned behavior.

Scaling to high dimensions: Most successful implementations use discrete state spaces or low-dimensional continuous spaces. Scaling EFE to pixel-level observations or large action spaces remains an active research area.

These are not show-stoppers. They're engineering challenges with active research programs addressing them.

Expected Free Energy: The Objective Function That Plans

Expected Free Energy: The Objective Function That Plans

What Makes a Good Objective Function?

Expected Free Energy: The Definition

The Two Components: Epistemic and Pragmatic Value

Epistemic Value: The Drive to Learn

Pragmatic Value: The Drive to Succeed

Why This Matters: Balancing Exploitation and Exploration

The Role of Precision

Planning as Policy Selection

Expected Free Energy in the Wild: Empirical Examples

Foraging Behavior

Saccadic Eye Movements

Active Learning

How This Differs from Reward-Based Planning

The Connection to Coherence

Practical Implementation: Computing EFE

Why "Free Energy" and Not Just "Expected Utility"?

Challenges and Open Questions

Further Reading

Comments ()

Expected Free Energy: The Objective Function That Plans

What Makes a Good Objective Function?

Expected Free Energy: The Definition

The Two Components: Epistemic and Pragmatic Value

Epistemic Value: The Drive to Learn

Pragmatic Value: The Drive to Succeed

Why This Matters: Balancing Exploitation and Exploration

The Role of Precision

Planning as Policy Selection

Expected Free Energy in the Wild: Empirical Examples

Foraging Behavior

Saccadic Eye Movements

Active Learning

How This Differs from Reward-Based Planning

The Connection to Coherence

Practical Implementation: Computing EFE

Why "Free Energy" and Not Just "Expected Utility"?

Challenges and Open Questions

Further Reading

Comments ( )

Comments ()