Performance Difference Lemma#

Performance Difference Lemma (Discounted)#

I’ve been seeing the PDL in a lot fo papers in RL. There are pretty good visualizations on the rollout intuition but I wanted to practice writing down a rigorous proof. I sometimes feel confused about the distributions under the expectations. So here’s my attempt at being rigorous about it.

\[ \begin{align}\begin{aligned}\begin{split} J(\pi_1) - J(\pi_2) = \underset{s \sim \mu}{\mathbb{E}}[V^{\pi_1}(s) - V^{\pi_2}(s)] \\ \end{split}\\\begin{split}= \underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}}[Q^{\pi_1}(s,a)] - V^{\pi_2} \bigr] \\ \end{split}\end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}\begin{split} = \underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [Q^{\pi_1(s,a)} - \color{green}{Q^{\pi_2}(s,a)} ] \bigr] + \underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [ \color{green}{Q^{\pi_2}(s,a)} - V^{\pi_2}(s) ] \bigr] \\ \end{split}\\ = \boxed{\underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [Q^{\pi_1(s,a)} - \color{green}{Q^{\pi_2}(s,a)}] \bigr]} + \underset{s, a \sim \mu . \pi_1}{\mathbb{E}}[A^{\pi_2}(s,a)] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \end{aligned}\end{align} \]

We can now focus on the boxed term and and we will soon see a recursive expression pop up

\[ \begin{align}\begin{aligned}\begin{split} \underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [Q^{\pi_1(s,a)} - Q^{\pi_2}(s,a)] \bigr] = \\ \underset{s \sim \mu}{\mathbb{E}} \Bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} \underset{s' \sim \mathbb{P}(.|s,a)}{\mathbb{E}} [ r(s,a) + \gamma V^{\pi_1}(s') - r(s,a) - V^{\pi_2}(s') ] \Bigr] \\ \end{split}\\= \underset{s \sim \mu}{\mathbb{E}} \Bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} \underset{s' \sim \mathbb{P}(.|s,a)}{\mathbb{E}} [\gamma V^{\pi_1}(s') - V^{\pi_2}(s') ] \Bigr] \end{aligned}\end{align} \]

Notice that the expectations can be interpreted as follows - Start from an initial state distribution \(\mu\). Follow \(\pi_1\). Now see what is the distribution of states at time \(t=1\). We can write it as follows

\[ = \gamma \underset{s \sim d_1^{\pi_1}}{\mathbb{E}} [ V^{\pi_1}(s) - V^{\pi_2}(s)] \]

Notice that this is very similar to the term we started with - there is an additional discount factor now. We can keep unrolling this and continue

\[ J(\pi_1) - J(\pi_2) = \sum_{h=0}^{\infty} \gamma^h \underset{s, a \sim d_h^{\pi_1} . \pi_1 }{\mathbb{E}}[A^{\pi_2}(s,a)] \]

Performance Difference Lemma (Finite Horizon)#

\[ \begin{align}\begin{aligned}\begin{split} J(\pi) - J(\pi') = \underset{s_0 \sim \mu_0}{\mathbb{E}}[V_0^{\pi}(s_0) - V_0^{\pi'}(s_0)] \\\end{split}\\\begin{split}= \underset{s_{\mu_0}}{\mathbb{E}} \bigl[ \underset{a_0 \sim \pi_0(.|s)}{\mathbb{E}} [Q_0^{\pi}(s_0, a_0)] - V_0^{\pi'}(s_0) \bigr] \\ \end{split}\\\begin{split}= \underbrace{\underset{s_0 \sim \mu_0}{\mathbb{E}} \Bigl[ \underset{a_0 \sim \pi_0(.|s)}{\mathbb{E}} [Q_0^{\pi}(s_0, a_0) - Q_0^{\pi'}(s_0, a_0)] \Bigr]}_{\text{term 1}} \\ \end{split}\\+ \underbrace{\underset{s_0 \sim \mu_0}{\mathbb{E}} \Bigl[ \underset{a_0 \sim \pi_0(.|s)}{\mathbb{E}} [Q_0^{\pi'}(s_0, a_0) - V_0^{\pi'}(s_0)] \Bigr]}_{\text{term 2}}\end{aligned}\end{align} \]
\[ \text{term 2} = \underset{s_0, a_0 \sim \mu_0, \pi_0(.|s_0)}{\mathbb{E}}[A_0^{\pi'}(s_0, a_0)] \]

Now we will expand term-1. Let’s expand the Q terms

\[ Q_0^{\pi}(s_0,a_0) - Q_0^{\pi'}(s_0,a_0) = \cancel{r(s_0, a_0)} + V_1^{\pi}(s_1) - \cancel{r(s_0, a_0)} - V_1^{\pi'}(s_1) \]
\[ \begin{align}\begin{aligned}\begin{split} \underset{s_0 \sim \mu_0}{\mathbb{E}} \Bigl[ \underset{a_0 \sim \pi_0(.|s)}{\mathbb{E}} [Q_0^{\pi}(s_0, a_0) - Q_0^{\pi'}(s_0, a_0)] \Bigr] = \underset{s_0 \sim \mu_0}{\mathbb{E}}\Bigl[ \underset{a_0 \sim \pi_0(.|s_0)}{\mathbb{E}} \bigl[ \underset{s_1 \sim P(s_1|s_0,a_0)}{\mathbb{E}}[V_1^{\pi}(s_1) - V_1^{\pi'}(s_1)] \bigr] \Bigr] \\ \end{split}\\= \underset{s_1 \sim \mu_1}{\mathbb{E}}[V_1^{\pi}(s_1) - V_1^{\pi'}(s_1)] \end{aligned}\end{align} \]

where \(\mu_1(s_1) = \sum_{s_0 \sim \mu_0} \sum_{a_0 \sim \pi_0(.|s_0)} P(s_1|s_0, a_0)\)

We are now seeing a recursive equation. This is the same as the term we started with but now we are at a new state distribution

\[ J(\pi) - J(\pi') = \sum_{h=0}^{H-1} \underset{s_h, a_h \sim \mu_h . \pi_h}{\mathbb{E}}[A_h^{\pi'}(s_h, a_h)] \]