2 经典模型

2.1 DQN

2.1.1 核心公式

贝尔曼最优方程

\[ \begin{aligned} Q^*(s,a) &=\mathbb{E}_{s'\sim P(\cdot|s,a)} \left[ r(s,a)+\gamma\cdot \max_{a'\in A}Q^*(s',a') \right]\\ &=r(s,a)+\gamma\cdot \sum_{s'\in S}P(s'|s,a)\cdot \max_{a'\in A}Q^*(s',a')\\ \end{aligned} \]

时间差分TD \(\(\delta = r+\gamma\cdot \max_{a'\in A} Q(s',a')-Q(s,a)\)\)

2.1.2 高估问题

最大化导致动作价值高估
对于随机变量\(X\)，采样得到\(\{x_1\,\cdots,x_n\}\)，添加均值为0的噪声得到\(\{z_1,\cdots,z_n\}\)，则\(\mathbb{E}\left[\max (z_1,\cdots,z_n) \right]\geq \max(x_1,\cdots,x_n)\)
自举导致高估传播

2.1.3 若干变种

Target Network：\(\delta = r+\gamma\cdot Q_{\textcolor{red}{\omega^+}}(s',\arg \max_{a'} Q_{\textcolor{red}{\omega^+}}(s',a'))-Q_{\textcolor{blue}{\omega}}(s,a)\)
Double DQN：\(\delta = r+\gamma\cdot Q_{\textcolor{red}{\omega^+}}(s',\arg \max_{a'} Q_{\textcolor{blue}{\omega}}(s',a'))-Q_{\textcolor{blue}{\omega}}(s,a)\)
Dueling DQN：\(Q_{\omega,\alpha,\beta}(s,a) = V_{\omega, \alpha}(s)+A_{\omega,\beta}(s,a)-\max_{\hat{a}\in A} A_{\omega,\beta}(s,\hat{a})\)
最后减去最大优势值，保证\(V\)和\(A\)取值唯一而不会随意波动
优先经验回放 PER：按照TD误差绝对值对经验池的样本构建优先级，引入两个参数\(\alpha\)和\(\beta\)，其中\(\alpha\)控制均匀采样和优先级采样的trade-off，\(\beta\)控制重要性采样的系数
多步TD算法
Noisy Net：在网络参数中添加高斯噪声
Distributional DQN：使用一个分布代替一个值来估计动作价值

2.2 Policy Gradient

2.2.1 核心公式

Part 1：

\[ \begin{aligned} \nabla_\theta P(\tau|\theta)&=P(\tau|\theta)\cdot \nabla_\theta \log P(\tau|\theta)\\ P(\tau|\theta)&=\rho_0(s_0)\prod_{t=0}^{T}P(s_{t+1}|s_t,a_t)\cdot \pi_\theta(a_t|s_t)\\ \log P(\tau|\theta)&=\log \rho_0(s_0)\sum_{t=0}^{T}\left(\log P(s_{t+1}|s_t,a_t)+ \log \pi_\theta(a_t|s_t)\right)\\ \nabla_\theta \log P(\tau|\theta)&=\log \rho_0(s_0)\sum_{t=0}^{T}\left(\nabla_\theta \log P(s_{t+1}|s_t,a_t)+ \nabla_\theta \log \pi_\theta(a_t|s_t)\right)\\ &=\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t|s_t)\\ \end{aligned} \]

Part 2：

\[ \begin{aligned} \nabla_\theta J(\pi_\theta) &= \nabla_\theta \mathbb{E}_{\tau\sim \pi_\theta}[R(\tau)]\\ &=\nabla_\theta \sum_\tau P(\tau|\theta)\cdot R(\tau)\\ &=\sum_\tau\nabla_\theta P(\tau|\theta)\cdot R(\tau)\\ &=\sum_\tau \left [P(\tau|\theta)\cdot \nabla_\theta \log P(\tau|\theta)\right ]\cdot R(\tau)\\ &=\mathbb{E}_{\tau\sim\pi_\theta} \left [ \nabla_\theta \log P(\tau| \theta)\cdot R(\tau) \right ]\\ &=\mathbb{E}_{\tau\sim\pi_\theta} \left [\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t|s_t)\cdot R(\tau) \right ]\\ \end{aligned} \]

Part 3:

\[ \begin{aligned} \hat{g}&=\frac{1}{|D|}\sum_{\tau\in D}\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)\cdot R(\tau)\\ \end{aligned} \]

2.2.2 若干变种

推广到一般形式：\(\nabla_\theta J(\pi_\theta)=\mathbb{E}_{\tau\sim\pi_\theta} \left [\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t|s_t)\cdot \Phi_t \right ]\)
REINFORCE：\(\Phi_t=G_t=\sum_{t'=t}^T \gamma^{t'-t}\cdot r_{t'}\)
REINFORCE with Baseline：\(\Phi_t=G_t-b(s_t)\)，\(b(s_t)\)需与\(a\)无关，一般取\(V^\pi(s_t)\)
Actor-Critic：\(\Phi_t=Q^\pi(s_t,a_t)\)
A2C：\(\Phi_t=A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)=r(s_t,a_t)+\gamma\cdot V^\pi(s_{t+1})-V^\pi(s_t)\)

2.3 TRPO

2.3.1 核心公式

新旧策略的差异：

\[ \begin{aligned} \rho^\pi(s)&=\sum_{t=0}^\infty\gamma\cdot P^\pi_t(s)\\ \eta(\pi)&=\mathbb{E}_\tau\left[ \sum_{t=0}^T \gamma^t\cdot R(s_t,a_t,s_{t+1}) \right]\\ \mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^t\cdot A^\pi(s_t,a_t) \right] &=\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^t\cdot \left( R(s_t,a_t,s_{t+1})+\gamma V^\pi(s_{t+1})-V^\pi(s_t) \right) \right]\\ &=\eta(\hat{\pi})+\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^{t+1} V^\pi(s_{t+1})- \sum_{t=0}^\infty \gamma^t V^\pi(s_t) \right]\\ &=\eta(\hat{\pi})+\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=1}^\infty \gamma^{t} V^\pi(s_{t})- \sum_{t=0}^\infty \gamma^t V^\pi(s_t) \right]\\ &=\eta(\hat{\pi})-\mathbb{E}_{\tau\sim \hat{\pi}}\left[ V^\pi(s_0) \right]\\ &=\eta(\hat{\pi})-\eta(\pi)\\ \end{aligned} \]

Surrogate Function：

\[ \begin{aligned} \eta(\hat{\pi})&=\eta(\pi)+\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^t\cdot A^\pi(s_t,a_t) \right]\\ &=\eta(\pi)+ \sum_{s}\rho^{\textcolor{blue}{\hat{\pi}}}(s)\sum_a \hat{\pi}(a|s)\cdot A^\pi(s,a)\\ L_\pi(\hat{\pi})&=\eta(\pi)+ \sum_{s}\rho^{\textcolor{red}{\pi}}(s)\sum_a \hat{\pi}(a|s)\cdot A^\pi(s,a)\\ \eta(\hat{\pi})& \geq L_{\pi}(\hat{\pi})-\frac{4\varepsilon\gamma}{(1-\gamma)^2}\alpha^2 \ \begin{cases} \varepsilon=\max_{s,a}|A^\pi(s,a)|\\ \alpha = D_{TV}^{max}(\pi,\hat{\pi})\\ D_{TV}^{max}(\pi,\hat{\pi}) = \max D_{TV}(\pi(\cdot|s)||\hat{\pi}(\cdot|s))\\ D_{TV}(p||q)=\frac{1}{2}\sum_{i}|p_i-q_i| \end{cases}\\ \eta(\hat{\pi})& \geq L_{\pi}(\hat{\pi})-C\cdot D_{KL}^{max}(\pi, \hat{\pi}) \ \begin{cases} (D_{TV}(p||q))^2 \leq D_{KL}(p||q) \\ C=\frac{4\varepsilon \gamma}{(1-\gamma)^2}\\ \end{cases}\\ \end{aligned} \]

带约束的优化问题：

\[ \begin{aligned} \max_\pi L_{\pi_{old}}(\pi) \ \ &\text{s.t.} \ D_{KL}^{max}(\pi_{old}, \pi)\leq \delta \\ \max_\pi L_{\pi_{old}}(\pi) \ \ &\text{s.t.} \ {\textcolor{blue}{\bar{D_{KL}}}} (\pi_{old}, \pi)\leq \delta \\ \max_\pi \sum_{s}\rho^{\pi_{old}}(s)\sum_a \pi(a|s)\cdot A^{\pi_{old}}(s,a) \ \ &\text{s.t.} \ {\bar{D_{KL}}} (\pi_{old}, \pi)\leq \delta \\ \max_\pi \mathbb{E}_{s\sim\rho^{\pi_{old}},a\sim \pi_{old}}\left[ \frac{\pi(a|s)}{\pi_{old}(a|s)}\cdot A^{\pi_{old}}(s,a) \right] \ \ &\text{s.t.} \ {\bar{D_{KL}}} (\pi_{old}, \pi) \leq \delta \\ \end{aligned} \]

求解优化：KKT条件 + 自然梯度法

2.4 PPO

2.4.1 核心公式

PPO-Penalty：

\[ \begin{aligned} &\max_\pi \mathbb{E}_{s\sim\rho^{\pi_{old}},a\sim \pi_{old}}\left[ \frac{\pi(a|s)}{\pi_{old}(a|s)}\cdot A^{\pi_{old}}(s,a) - \beta\cdot D_{KL} (\pi_{old}, \pi) \right] \\ &\begin{cases} \beta_{k+1} = \beta_k/2 & D_{KL} (\pi_{old}, \pi)<\delta/1.5\\ \beta_{k+1} = \beta_k\cdot 2 & D_{KL} (\pi_{old}, \pi)>\delta\cdot 1.5\\ \beta_{k+1} = \beta_k & other\\ \end{cases} \end{aligned} \]

PPO-Clip：

\[ \max_\pi \mathbb{E}_{s\sim\rho^{\pi_{old}},a\sim \pi_{old}}\left[ \min \left( \frac{\pi(a|s)}{\pi_{old}(a|s)}\cdot A^{\pi_{old}}(s,a), \text{clip}\left(\frac{\pi(a|s)}{\pi_{old}(a|s)},1+\varepsilon, 1-\varepsilon\right)\cdot A^{\pi_{old}}(s,a) \right )\right] \]

2.5 DDPG

2.5.1 核心公式

确定性策略梯度定理：

\[ \begin{aligned} J(\pi_\theta) &=\mathbb{E}_{s\sim\textcolor{red}{\rho^{\hat{\pi}}}}\left[ Q_\omega(s,\mu_\theta(s)) \right]\\ \nabla_\theta J(\pi_\theta) &= \frac{\partial Q_\omega(s,\mu_\theta(s))}{\partial \theta}\\ &= \frac{\partial Q_\omega(s,\mu_\theta(s))}{\partial \mu_\theta(s)}\cdot \frac{\partial \mu_\theta(s)}{\partial \theta}\\ &= \nabla_\theta \mu_\theta(s)\cdot \nabla_a Q_\omega(s,a) \ \ \ \ \ \ \ \ \ a=\mu_\theta(s) \\ \end{aligned} \]

2.5.2 关于Off-Policy

DDPG与Q-Learning的关系密切，且动机相同：如果已知最优动作价值函数\(Q^*(s,a)\)，则在任意给定状态\(s\)下，最优动作为\(a^*=\arg \max_a Q^*(s,a)\)。

DDPG假设\(Q\)对\(a\)可微，将最优动作\(a^*\)参数化为\(\mu_\theta(s)\)，则有如下近似：\(\max_a Q^*(s,a)\approx Q_\omega(s,\mu_\theta(s))\)。

因此DDPG与DQN一样，是Off-Policy的。

2.6 SAC

2.6.1 核心公式

优化目标—最大熵强化学习：

\[ \pi^*_{MaxEnt}=\arg \max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim \rho^\pi}\left[ r(s_t,a_t)+\alpha\cdot \textcolor{red}{H(\pi(\cdot|s_t))}\right] \]

Soft Q-function：

\[ \begin{aligned} Q_{soft}(s_t,a_t)&=r(s_t,a_t)+\mathbb{E}_{\tau\sim\pi}\left[\sum_{l=1}^\infty \gamma^l\cdot [r(s_{t+l},a_{t+l})+\alpha\cdot H(\pi(\cdot|s_{t+l}))]\right]\\ &\Updownarrow\\ r_{soft}(s_t,a_t)&=r(s_t,a_t)+\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]\\ \end{aligned} \]

Soft Bellman Equation:

\[ \begin{aligned} Q_{soft}(s_t,a_t)&=\textcolor{red}{r_{soft}(s_t,a_t)}+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]\\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]}\\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[-\alpha\cdot \log \pi(a_{t+1}|s_{t+1})\right]}\\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[\textcolor{green}{Q_{soft}(s_{t+1},a_{t+1})-\alpha\cdot \log(\pi(a_{t+1}|s_{t+1}))}\right] \\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1}}\left[\textcolor{blue}{V_{soft}(s_{t+1})}\right]\\ &\Downarrow\\ V_{soft}(s_t)&=\mathbb{E}_{a_t}\left[Q_{soft}(s_t,a_t)-\alpha\cdot\log\pi(a_t|s_t)\right]\\ \end{aligned} \]

Policy Improvement:

\[ \begin{aligned} \pi_{new}=\arg\min_{\pi'\in \Pi} \mathrm{D}_{\mathrm{KL}}\left(\pi'(\cdot|s_t)\bigg\|\frac{\exp(Q^{\pi_\mathrm{old}}(s_t,\cdot))}{Z^{\pi_\mathrm{old}}(s_t)}\right) \end{aligned} \]

Optimize Target：

\[ \begin{aligned} J_{V}(\psi)&=\mathbb{E}_{s_t\sim D}\left[\frac{1}{2}\left(V_\psi(s_t)-\mathbb{E}_{a_t\sim\pi_\phi}\left[Q_\theta(s_t,a_t)-\log\pi_\phi(a_t|s_t)\right]\right)^2\right]\\ J_{Q}(\theta)&=\mathbb{E}_{(s_t,a_t)\sim D}\left[\frac{1}{2}\left(Q_\theta(s_t,a_t)-\hat{Q}_\theta(s_t,a_t)\right)^2\right]\\ J_\pi(\phi)&=\mathbb{E}_{s_t\sim D}\left[\text{D}_{\text{KL}}\left(\pi_{\psi}(\cdot|s_t)\bigg\|\frac{\exp (Q_\theta(s_t,\cdot))}{Z_\theta(s_t)}\right)\right] \end{aligned} \]

2.7 TD3

*2.8 Soft Q-Learning

优化目标—最大熵强化学习：

\[ \pi^*_{MaxEnt}=\arg \max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim \rho^\pi}\left[ r(s_t,a_t)+\alpha\cdot \textcolor{red}{H(\pi(\cdot|s_t))}\right] \]

玻尔兹曼分布：描述粒子处于特定状态下的概率，是关于状态能量与系统温度的函数。设粒子处于状态\(\alpha\)的概率为\(p_\alpha\)，状态\(\alpha\)的能量为\(\mathcal{E}_\alpha\)，玻尔兹曼常量\(k\)，系统温度\(T\)，则有：

\[ p_\alpha=\frac{1}{Z}\exp \left( \frac{-\mathcal{E}_\alpha}{k \ T} \right) \]

其中\(\exp\left(\frac{-\mathcal{E}_\alpha}{k \ T} \right)\)称为玻尔兹曼因子（未归一化），\(Z=\sum_\alpha \exp \left( \frac{-\mathcal{E}_\alpha}{k \ T} \right)\)为配分函数（计算所有状态的总和）。玻尔兹曼分布认为，能量较低的状态总是有较高的概率被占用。

Soft Q-function：

\[ \begin{aligned} Q_{soft}(s_t,a_t)&=r(s_t,a_t)+\mathbb{E}_{\tau\sim\pi}\left[\sum_{l=1}^\infty \gamma^l\cdot [r(s_{t+l},a_{t+l})+\alpha\cdot H(\pi(\cdot|s_{t+l}))]\right]\\ &\Updownarrow\\ r_{soft}(s_t,a_t)&=r(s_t,a_t)+\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]\\ \end{aligned} \]

Soft V-function：

\[ \begin{aligned} V_{soft}(s_t)&=\alpha\cdot \log \int \exp\left(\frac{1}{\alpha}Q_{soft}(s_t,a)\right)\mathrm{d}a\\ &=\log \int \exp (Q_{soft}(s_t,a))\mathrm{d}a \rightarrow \text{LogSumExp, Soft Maximum} \end{aligned} \]

Soft Bellman Equation:

\[ \begin{aligned} Q_{soft}(s_t,a_t)&=\textcolor{red}{r_{soft}(s_t,a_t)}+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]\\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]}\\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[-\alpha\cdot \log \pi(a_{t+1}|s_{t+1})\right]}\\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[\textcolor{green}{Q_{soft}(s_{t+1},a_{t+1})-\alpha\cdot \log(\pi(a_{t+1}|s_{t+1}))}\right] \\ &=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1}}\left[\textcolor{blue}{V_{soft}(s_{t+1})}\right]\\ &\Downarrow\\ V_{soft}(s_t)&=\mathbb{E}_{a_t}\left[Q_{soft}(s_t,a_t)-\alpha\cdot\log\pi(a_t|s_t)\right]\\ \end{aligned} \]

Policy：直观理解，尽可能让策略分布与Q函数形状相似，即\(\pi(a|s)\propto \exp Q(s,a)\)

\[ \begin{aligned} \mathcal{E}(s_t,a_t)&=-\frac{1}{\alpha}\cdot Q_{soft}(s_t,a_t)\\ \pi(a_t|s_t)&\propto\exp(-\mathcal{E}(s_t,a_t))\\ &\propto\exp(Q_{soft}(s_t,a_t))\\ \pi(a_t|s_t)&=\exp\left(\frac{1}{\alpha}(Q_{soft}(s_t,a_t)-V_{soft}(s_t))\right) \end{aligned} \]

Soft Q-Iteration：

\[ \begin{aligned} Q_{soft}(s_t,a_t)&\leftarrow r(s_t,a_t)+\gamma\cdot\mathbb{E}_{s_{t+1}}\left[V_{soft}(s_{t+1})\right]\\ V_{soft}(s_t)&\leftarrow \alpha\log\int\exp\left(\frac{1}{\alpha}Q_{soft}(s_t,a)\right)\mathrm{d}a\\ \end{aligned} \]

\(V_{soft}\)的迭代更新需要对整个动作空间进行积分
参数化网络\(Q_{soft}^\theta(s_t,a_t)\)
通过重要性采样转化为随机优化问题

\[ \begin{aligned} V_{soft}^\theta(s_t)&=\alpha\log\mathbb{E}_{q_{a'}}\left[\frac{\exp (\frac{1}{\alpha}Q_{soft}^\theta(s_t,a'))}{q_{a'}(a')}\right]\\ &\Downarrow\\ J_Q(\theta)&=\mathbb{E}_{s_t\sim \textcolor{red}{q_{s_t}}, a_t\sim \textcolor{blue}{q_{a_t}}}\left[\frac{1}{2}\left(\hat{Q}_{soft}^\theta(s_t,a_t)-Q_{soft}^\theta(s_t,a_t)\right)^2\right] \end{aligned} \]

最优策略服从energy-based分布，难以采样