Policy Gradients
The core idea of Reinforcement learning is to learn some kind of behavior through optimizing for rewards. The behavior learned by an agent i.e. the schema it follows while going through this process is the learned policy that it uses to decide which action to take and thus, the transition from one state to another. One way to close the loop for the agent to learn is by evaluating the states and actions through value functions and thus, our way to measure the learned policy is seen through these value functions, approximated by lookup tables, linear combinations, Neural Networks e.t.c. Policy Gradient methods take a different approach where they bypass the need for a value function by parameterizing the policy directly. While the agent can still use a value function to learn, it need not use it for selecting actions. The advantages that policy gradient methods offer are 3 fold:
- Approximating the policy might be simpler than approximating action values and a policy-based method might typically learn faster and yield a superior asymptotic policy. One very good example that illustrates this is the work by Simsek et. al on the game of Tetris where they showed that it is possible to choose amongst actions without really evaluating them.
- Policy gradient methods can handle stochastic policies. The case of card games with imperfect information, like poker, is a direct example where the optimal play might be to do 2 different things with certain probabilities. If we are maximizing the actions based on value approximations, we don’t really have a natural way of finding stochastic policies. Policy Gradient methods can do this.
- Policy gradient methods offer stronger convergence guarantees since with continuous policies the action probabilities change smoothly. This is not the case with the fixed
-greedy evaluation since there is always a probability to do something random. - The choice of policy parameterization is sometimes a good way of injecting prior knowledge about a desired form of the policy into the system. This is especially helpful when we look at introducing Meta-Learning strategies into Reinforcement Learning.
In the following sections, I first use the theoretical treatment done in Sutton and Barto’s book since I was better able to understand policy gradients’ essence through this. However, I again do the derivation by looking at the whole thing from the viewpoint of trajectories, since I find it more intuitive.
Policy Gradient Theorem
The issue with the parameterization of the policy is that the policy affects both the action selections and the distribution of states in which those selections are made. While going from state to action is straightforward, going the other way round involves the environment and thus, the parameterization is typically unknown. Thus, with this unknown effect of policy changes on the state distributions, the issue is evaluating the gradients of the performance. This is where the policy gradient theorem comes into the picture, as it shows that the gradient of the policy w.r.t its parameters does not involve the derivative of the state distribution. For episodic tasks, if we assume that every episode starts in some particular state
For simplicity, we remove the
we now extend
The reward is independent of the parameters $\theta$, so we can set that derivative inside the sum to 0, and so, we get:
Thus, we now have a recursive formulation of
Here,
In our derivation,
Thus, we get the form of the theorem as
The proportionality is the average length of an episode. In the case of continuous tasks, this is 1. Thus, now we see that we have a gradient over a parameterized policy, which allows us to move in the direction of maximizing this gradient i.e gradient ascent. We can estimate this gradient through different means.
REINFORCE: Monte-Carlo Sampling
For Monte-Carlo sampling of the policy, our essential requirement is sampling from a distribution that allows us to get an estimate of the policy. From the equation of the policy gradient theorem, we can write this again as an expectation over a sample of states
The expectation above would be an expectation over the actions if we were to include the probability of selecting the actions as the weight. Thus, we can do that to remove the sum over actions too:
The expectation over
We now have a full sampling of the states and actions conditioned on our parameters in the gradients. This can be considered a sample from the policy and we can update our parameters using this quantity to get our update rule as:
This is the REINFORCE Algorithm! We have each update which is simply the learning rate
Looking at Trajectories
Another way to look at the above formulation is through sampled trajectories, as done in Sergey Levine’s slides. While theoretically, this treatment is similar to the one done above, I just find the notation more intuitive. Recall, a trajectory is a sequence of states and actions over time, and the rewards accumulated in this sequence qualify this trajectory. Thus, we can say that the utility objective
Let this sum of reward be denoted by
This expectation is essentially sampling a trajectory from a policy and weighing it with the accumulated rewards. Thus, we can write it as
Now, we just differentiate this objective, and add a convenient policy term to make it an expectation:
Now, we just use an identity
Thus, we can write this as:
Hence, we get the same final result as before: the gradient depends on the gradient of the log policy weighted by the rewards accumulated over the trajectory. Now, we can translate this to states and actions over the trajectory by simply considering what the policy of the trajectory represents:
When we differentiate this log policy w.r.t | s_t, a_t)$$ do not depend on this parameter, and would be set to 0. Thus, our utility expression would end up looking like |
Then we take average of the samples as the expectation, we get
And this is the key formula behind REINFORCE again and the update can thus, be written as
This is where I find it more intuitive to just use
-
Sample a set of trajectories from the policy $$\pi_{\mathbf{\theta}}(a_ts_t)$$ - Estimate
- Update the parameters
Reducing Variance
The idea of parameterizing the policy and working directly with the sampled trajectories has an intuitive appeal due to its clarity. However, this approach suffers from high variance. This is because when we compute the expectation over trajectories, we are essentially sampling
Rewards to go
In the trajectory formulation, we are accumulating all of the rewards from
Thus, by reducing the number of rewards we consider for each policy evaluation, we are essentially better able to control the variance up to a certain extent
Baselining
Another way to control the variance is to realize that the actions in a state are the quantities that create a variance for each state in trajectory. However, we see that our return
since
Thus, we can effectively update our REINFORCE update with this baseline to get
Or, in the trajectory formulation
One good function for