Performance Difference Lemma (Discounted)
I’ve been seeing the PDL in a lot fo papers in RL. There are pretty good visualizations on the rollout intuition but I wanted to practice writing down a rigorous proof. I sometimes feel confused about the distributions under the expectations. So here’s my attempt at being rigorous about it.
\[ \begin{align}\begin{aligned}\begin{split}
J(\pi_1) - J(\pi_2) = \underset{s \sim \mu}{\mathbb{E}}[V^{\pi_1}(s) - V^{\pi_2}(s)] \\ \end{split}\\\begin{split}= \underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}}[Q^{\pi_1}(s,a)] - V^{\pi_2} \bigr] \\ \end{split}\end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}\begin{split}
= \underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [Q^{\pi_1(s,a)} - \color{green}{Q^{\pi_2}(s,a)} ] \bigr] + \underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [ \color{green}{Q^{\pi_2}(s,a)} - V^{\pi_2}(s) ] \bigr] \\ \end{split}\\
= \boxed{\underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [Q^{\pi_1(s,a)} - \color{green}{Q^{\pi_2}(s,a)}] \bigr]} + \underset{s, a \sim \mu . \pi_1}{\mathbb{E}}[A^{\pi_2}(s,a)] \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \
\end{aligned}\end{align} \]
We can now focus on the boxed term and and we will soon see a recursive expression pop up
\[ \begin{align}\begin{aligned}\begin{split}
\underset{s \sim \mu}{\mathbb{E}} \bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} [Q^{\pi_1(s,a)} - Q^{\pi_2}(s,a)] \bigr] = \\ \underset{s \sim \mu}{\mathbb{E}} \Bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} \underset{s' \sim \mathbb{P}(.|s,a)}{\mathbb{E}} [ r(s,a) + \gamma V^{\pi_1}(s') - r(s,a) - V^{\pi_2}(s') ] \Bigr] \\ \end{split}\\= \underset{s \sim \mu}{\mathbb{E}} \Bigl[ \underset{a \sim \pi_1(.|s)}{\mathbb{E}} \underset{s' \sim \mathbb{P}(.|s,a)}{\mathbb{E}} [\gamma V^{\pi_1}(s') - V^{\pi_2}(s') ] \Bigr]
\end{aligned}\end{align} \]
Notice that the expectations can be interpreted as follows - Start from an initial state distribution \(\mu\). Follow \(\pi_1\). Now see what is the distribution of states at time \(t=1\). We can write it as follows
\[
= \gamma \underset{s \sim d_1^{\pi_1}}{\mathbb{E}} [ V^{\pi_1}(s) - V^{\pi_2}(s)]
\]
Notice that this is very similar to the term we started with - there is an additional discount factor now. We can keep unrolling this and continue
\[
J(\pi_1) - J(\pi_2) = \sum_{h=0}^{\infty} \gamma^h \underset{s, a \sim d_h^{\pi_1} . \pi_1 }{\mathbb{E}}[A^{\pi_2}(s,a)]
\]