Notes on Policy Gradients
Policy Gradients is a Reinforcement Learning optimization method, this means that there are no functions such as \(V(s)\) or \(Q(s,a)\) involved to find a good policy indirectly. In this kind of method, the goal is to directly find a policy \( \pi_{\theta}(u|s) \) which is a distribution over actions \( u \) given a state \( s \) that maximizes the expected sum of rewards.
As the policy a function aproximator such as a neural network parameterized by \( \theta \) can be used, this network takes as an input information about the state \( s_{t} \) and outputs a distribution over possible actions. To act, in any given time, we sample from that distribution an action \( u_{t} \) and see what state \( s_{t+1} \) and reward \( r_{t+1} \) are returned from the environment.
As opposed to state-value function \( V(s) \) and action-value function \( Q(s,a) \) methods, \(\theta\) is not used to predict the value or quality of a given state but to parameterize the policy directly.
Policy optimization
As stated before, the goal is to find a policy, which outputs a distribution over actions for any given state, to maximize the expected sum of rewards throughout a sequence of states and actions. This can be framed as an optimization problem that can be achieved by changing the parameters \(\theta\) of the neural network.
$$
\newcommand{\E}{\mathrm{E}}
\underset{\theta}{\text{max}}\quad \E \Big[ \sum_{t=0}^{H} R(s_{t})\ |\ \pi_{\theta}\ \Big]
$$
As it can be read from the expression above, what is to be maximized is the expectation of the sum of rewards through several states under a given policy. Therefore, the sum of rewards given the policy has to be treated as a random variable.
Likelihood Ratio Policy Gradient
In order to go further with the optimization process, it is convinient to introduce some notation that will allow us to represent the problem in a more compact fashion. Hence, \( \tau \) will denote a state-action sequence.
$$
\tau = s_{0},u_{0},...,s_{H},u_{H}
$$
Rewards can be expressed as the summation of the individual rewards over the entire trajectory \( \tau \).
$$
R(\tau) = \sum_{t=0}^HR(s_{t},u_{t})
$$
Apart from the total reward of a state-action sequence, we can talk about the concept of utility which is the expectation of the sum of individual rewards along \( \tau \) under \( \pi_{\theta} \).
$$
U(\theta) = \E \Big[ \sum_{t=0}^{H} R(s_{t},u_{t});\pi_{\theta}\ \Big]
$$
So, the utility of \( \theta \) is nothing but the expected value of the reward of trajectories \( R(\tau) \) following a parameterized policy \( \pi_{\theta} \). Given that the utility is an expectation of total reward over all possible trajectories, it can also be expressed as a sum of products where each product contains a trajectory reward times its probability. This is true because an expectation is the same as a weighted sum where each weight represents how frequent each \(R(\tau)\) is.
$$
U(\theta) = \E \Big[ \sum_{t=0}^{H} R(s_{t},u_{t});\pi_{\theta}\ \Big] =
\sum_{\tau}P(\tau;\theta)R(\tau)
$$
From the previous equation, it becomes more evident what utility means; it is the expected reward that can be obtained throughout many different state-action sequences if we follow a certain policy. Since we are talking about policy optimization, the idea behind all this formulation is to enhance the policy in order to obtain as much reward as possible regardless the trajectory that occurs.
Thus, by "maximizing the utility" we mean to increase the probabilities of obtaining rewards over several state-action sequences. And, since these probabilities are parameterized by \( \theta \), it is by chainging these that it becomes possible to increase the likelihood of obtaining greater rewards.
$$
\underset{\theta}{\text{max}}\ U(\theta) =
\underset{\theta}{\text{max}}\ \sum_{\tau}P(\tau;\theta)R(\tau)
$$
The intuition from above should be clearer by now because the utility is a function of the parameters. For a particular set of \( \theta \) a certain level of reward can be attained, if a different set of parameters is fed into the utility function then a different amount of total reward will be observed.
To go on with the optimization the previous equation has to be derived, we need to calculate the gradient of the utility function with respect to \( \theta \).
$$
\begin{align*}
\nabla_{\theta}U(\theta) &= \nabla_{\theta} \sum_{\tau}P(\tau;\theta)R(\tau)
\\ &= \sum_{\tau}\nabla_{\theta} P(\tau;\theta)R(\tau) \quad \tiny\texttt{The gradient of the sum is the sum of the gradients.}
\\ &= \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)} \nabla_{\theta}P(\tau;\theta)R(\tau) \quad \tiny\texttt{A little trick.}
\\ &= \sum_{\tau} P(\tau;\theta) \frac{\nabla_{\theta}P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \quad \tiny\texttt{Same value, different ratio.}
\\ &= \sum_{\tau} P(\tau;\theta) \nabla_{\theta}\log{P(\tau;\theta)} R(\tau) \quad \tiny\texttt{This is true since $
\frac{d}{dx}\log{f(x)} = \frac{f^\prime(x)}{f(x)}
$.}
\\ &= \E \Big[ \nabla_{\theta}\log{P(\tau;\theta)} R(\tau) \Big]
\end{align*}
$$
There are two good reasons for arranging the equations as shown in the previous steps. The first one has to do with the last expression we ended up with, it is the expectation of the gradient of the probability of a given sequence under our policy times its reward. Because we endend up with an expectation we now have a value that we can approximate through a sampe-based estimation.
$$
\nabla U(\theta) \approx \hat{g} = \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} \log{P(\tau^{(i)};\theta)}R(\tau^{(i)})
$$
The second reason is that, given that we have a log probability of a sequence under a policy, we can decompose that part of the equation into an expression that explicitly reflects the probability of a path of states and actions.
$$
\begin{align*}
\nabla_{\theta} \log{P(\tau^{(i)};\theta)} &= \nabla_{\theta}
\log{\Bigg[ \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)},u_{t}^{(i)})\cdot \pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)}) \Bigg]}
\\ &= \nabla_{\theta} \Bigg[ \sum_{t=0}^{H}\log{P(s_{t+1}^{(i)}|s_{t}^{(i)},u_{t}^{(i)})} + \sum_{t=0}^{H}\log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})} \Bigg]
\end{align*}
$$
The equation from above expresses the probability of a path as the probability of a state transition under an action, all as independent events. Due to logarithm rules, the log of a product can also be expressed as the sum of logarithms and so can our equation.
As the expression is derived with respect to \( \theta \), the first summation inside the brackets does not have any parameters to derive so it can simply be dropped. That leaves us with just the gradient of the policy with respect to \(\theta\).
$$
\begin{align*}
\nabla_{\theta} \log{P(\tau^{(i)};\theta}) &= \nabla_{\theta} \sum_{t=0}^{H}\log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})}
\\ &= \sum_{t=0}^{H}\nabla_{\theta} \log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})}
\end{align*}
$$
Now, we can arrange the gradient estimation as follows.
$$
\begin{align*}
\hat{g} &= \frac{1}{m} \sum_{i=1}^{m} \sum_{t=0}^{H}\nabla_{\theta} \log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})}
R(\tau^{(i)})
\\ &= \frac{1}{m} \sum_{i=1}^{m} \Bigg( \sum_{t=0}^{H}\nabla_{\theta} \log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})} \cdot R(s_{t}^{(i)},u_{t}^{(i)}) \Bigg)
\end{align*}
$$
According to the value of the total reward, the gradient of the log probability of an action given a statue will be increased or decreased. In other words, the gradient of the policy will be scaled in proportion to how good the reward was for a given sequence.