Policy Gradients

In order to go further with the optimization process, it is convinient to introduce some notation that will allow us to represent the problem in a more compact fashion. Hence, $ \tau $ will denote a state-action sequence. $$ \tau = s_{0},u_{0},...,s_{H},u_{H} $$ Rewards can be expressed as the summation of the individual rewards over the entire trajectory $ \tau $. $$ R(\tau) = \sum_{t=0}^HR(s_{t},u_{t}) $$ Apart from the total reward of a state-action sequence, we can talk about the concept of utility which is the expectation of the sum of individual rewards along $ \tau $ under $ \pi_{\theta} $. $$ U(\theta) = \E \Big[ \sum_{t=0}^{H} R(s_{t},u_{t});\pi_{\theta}\ \Big] $$ So, the utility of $ \theta $ is nothing but the expected value of the reward of trajectories $ R(\tau) $ following a parameterized policy $ \pi_{\theta} $. Given that the utility is an expectation of total reward over all possible trajectories, it can also be expressed as a sum of products where each product contains a trajectory reward times its probability. This is true because an expectation is the same as a weighted sum where each weight represents how frequent each $R(\tau)$ is. $$ U(\theta) = \E \Big[ \sum_{t=0}^{H} R(s_{t},u_{t});\pi_{\theta}\ \Big] = \sum_{\tau}P(\tau;\theta)R(\tau) $$ From the previous equation, it becomes more evident what utility means; it is the expected reward that can be obtained throughout many different state-action sequences if we follow a certain policy. Since we are talking about policy optimization, the idea behind all this formulation is to enhance the policy in order to obtain as much reward as possible regardless the trajectory that occurs.

Thus, by "maximizing the utility" we mean to increase the probabilities of obtaining rewards over several state-action sequences. And, since these probabilities are parameterized by $ \theta $, it is by chainging these that it becomes possible to increase the likelihood of obtaining greater rewards. $$ \underset{\theta}{\text{max}}\ U(\theta) = \underset{\theta}{\text{max}}\ \sum_{\tau}P(\tau;\theta)R(\tau) $$ The intuition from above should be clearer by now because the utility is a function of the parameters. For a particular set of $ \theta $ a certain level of reward can be attained, if a different set of parameters is fed into the utility function then a different amount of total reward will be observed.

To go on with the optimization the previous equation has to be derived, we need to calculate the gradient of the utility function with respect to $ \theta $. $$ \begin{align*} \nabla_{\theta}U(\theta) &= \nabla_{\theta} \sum_{\tau}P(\tau;\theta)R(\tau) \\ &= \sum_{\tau}\nabla_{\theta} P(\tau;\theta)R(\tau) \quad \tiny\texttt{The gradient of the sum is the sum of the gradients.} \\ &= \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)} \nabla_{\theta}P(\tau;\theta)R(\tau) \quad \tiny\texttt{A little trick.} \\ &= \sum_{\tau} P(\tau;\theta) \frac{\nabla_{\theta}P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \quad \tiny\texttt{Same value, different ratio.} \\ &= \sum_{\tau} P(\tau;\theta) \nabla_{\theta}\log{P(\tau;\theta)} R(\tau) \quad \tiny\texttt{This is true since $ \frac{d}{dx}\log{f(x)} = \frac{f^\prime(x)}{f(x)} $.} \\ &= \E \Big[ \nabla_{\theta}\log{P(\tau;\theta)} R(\tau) \Big] \end{align*} $$ There are two good reasons for arranging the equations as shown in the previous steps. The first one has to do with the last expression we ended up with, it is the expectation of the gradient of the probability of a given sequence under our policy times its reward. Because we endend up with an expectation we now have a value that we can approximate through a sampe-based estimation. $$ \nabla U(\theta) \approx \hat{g} = \frac{1}{m} \sum_{i=1}^{m} \nabla_{\theta} \log{P(\tau^{(i)};\theta)}R(\tau^{(i)}) $$ The second reason is that, given that we have a log probability of a sequence under a policy, we can decompose that part of the equation into an expression that explicitly reflects the probability of a path of states and actions. $$ \begin{align*} \nabla_{\theta} \log{P(\tau^{(i)};\theta)} &= \nabla_{\theta} \log{\Bigg[ \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)},u_{t}^{(i)})\cdot \pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)}) \Bigg]} \\ &= \nabla_{\theta} \Bigg[ \sum_{t=0}^{H}\log{P(s_{t+1}^{(i)}|s_{t}^{(i)},u_{t}^{(i)})} + \sum_{t=0}^{H}\log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})} \Bigg] \end{align*} $$ The equation from above expresses the probability of a path as the probability of a state transition under an action, all as independent events. Due to logarithm rules, the log of a product can also be expressed as the sum of logarithms and so can our equation.

As the expression is derived with respect to $ \theta $, the first summation inside the brackets does not have any parameters to derive so it can simply be dropped. That leaves us with just the gradient of the policy with respect to $\theta$. $$ \begin{align*} \nabla_{\theta} \log{P(\tau^{(i)};\theta}) &= \nabla_{\theta} \sum_{t=0}^{H}\log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})} \\ &= \sum_{t=0}^{H}\nabla_{\theta} \log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})} \end{align*} $$ Now, we can arrange the gradient estimation as follows. $$ \begin{align*} \hat{g} &= \frac{1}{m} \sum_{i=1}^{m} \sum_{t=0}^{H}\nabla_{\theta} \log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})} R(\tau^{(i)}) \\ &= \frac{1}{m} \sum_{i=1}^{m} \Bigg( \sum_{t=0}^{H}\nabla_{\theta} \log{\pi_{\theta}(u_{t}^{(i)}|s_{t}^{(i)})} \cdot R(s_{t}^{(i)},u_{t}^{(i)}) \Bigg) \end{align*} $$ According to the value of the total reward, the gradient of the log probability of an action given a statue will be increased or decreased. In other words, the gradient of the policy will be scaled in proportion to how good the reward was for a given sequence.