Policy Gradient Methods

What We Will Build Toward
When we first learn reinforcement learning, it is tempting to imagine that the hardest part is estimating a good value function and then simply acting greedily with respect to it. That works beautifully in discrete problems with a small action set: learn , choose , and you have a policy. The logic is elegant because it reduces control to prediction.
But that elegance hides an important assumption: that the action space is easy to search, that one action is enough to represent the right behavior, and that the agent can truly observe the relevant state. Once any of those assumptions breaks, a value function alone becomes a brittle intermediary rather than a solution.
The greedy policy induced by a value function is typically written as This expression is harmless when is finite and small. In that case, the argmax is just a comparison over a short list. But if , the expression quietly becomes a nested optimisation problem: every time the agent wants to act, it must solve a continuous maximisation over actions. For high-dimensional control, that is not a small implementation detail — it is the central computational bottleneck.
That is the first failure mode: continuous actions. Here, the issue is not that -learning is conceptually wrong, but that the greedy extraction step is no longer cheap. We are no longer choosing among a handful of actions; we are searching over an uncountable space. Unless the critic has a special structure, is itself a hard optimisation problem, and we have merely moved the difficulty from policy learning into action selection.
The second failure mode is more subtle: perceptual aliasing. Imagine two different underlying situations that produce the same observation. From the agent’s perspective, the states are indistinguishable, but the correct behavior may differ. A deterministic greedy policy must commit to one action per observation, yet the best response may require a mixture over actions. In that case, the policy needs to encode uncertainty or ambiguity directly: A stochastic policy can represent “sometimes go left, sometimes go right” in a principled way; a pure argmax policy cannot.
This distinction matters because the right response is not always “pick the single best action.” When two hidden states collapse to the same observation, the agent may need to randomize to average over incompatible optimal actions. In other words, the policy is not just a decision rule; it is part of the model of how the agent resolves ambiguity.
The third failure mode is partial observability, where the agent does not directly observe the full Markov state at all. In a POMDP, the observation history or belief state is what matters, and deterministic policies can be fundamentally suboptimal. Stochasticity is not merely a convenience here — it can be structurally necessary, and there are settings where the best deterministic policy performs arbitrarily badly compared with a stochastic one. This is one of the most important conceptual reasons to stop treating “policy = argmax over value” as the universal template.
So the real lesson is not “value functions are useless.” Rather, value functions are often the wrong interface for control. If the action space is large or continuous, or if the environment forces the agent to randomize, then the policy itself should be optimized directly. That is exactly what policy gradient methods do: instead of first learning a value surface and then extracting a policy by maximization, we parameterize the policy and improve by following the gradient of expected return.
The visual below compresses these three failure modes into one glance. The left panel emphasizes that in continuous action spaces, the obstacle is the search problem hidden inside . The middle panel shows why aliased observations can demand a mixed strategy, making a stochastic policy strictly more expressive than a deterministic one. The right panel captures the POMDP intuition: when the agent does not fully observe the state, stochastic policies can outperform any fixed deterministic rule.
Taken together, the three panels justify the move that policy gradient methods make from the start: rather than asking how to recover a policy from a value function, we ask how to optimize the policy directly. That shift is what makes the rest of the lecture possible — from the likelihood-ratio derivation to REINFORCE, baselines, and actor-critic methods.





























