2 经典模型
2.1 DQN
2.1.1 核心公式
\[
\begin{aligned}
Q^*(s,a)
&=\mathbb{E}_{s'\sim P(\cdot|s,a)} \left[ r(s,a)+\gamma\cdot \max_{a'\in A}Q^*(s',a') \right]\\
&=r(s,a)+\gamma\cdot \sum_{s'\in S}P(s'|s,a)\cdot \max_{a'\in A}Q^*(s',a')\\
\end{aligned}
\]
- 时间差分TD
\(\(\delta = r+\gamma\cdot \max_{a'\in A} Q(s',a')-Q(s,a)\)\)
2.1.2 高估问题
- 最大化导致动作价值高估
- 对于随机变量\(X\),采样得到\(\{x_1\,\cdots,x_n\}\),添加均值为0的噪声得到\(\{z_1,\cdots,z_n\}\),则\(\mathbb{E}\left[\max (z_1,\cdots,z_n) \right]\geq \max(x_1,\cdots,x_n)\)
- 自举导致高估传播
2.1.3 若干变种
- Target Network:\(\delta = r+\gamma\cdot Q_{\textcolor{red}{\omega^+}}(s',\arg \max_{a'} Q_{\textcolor{red}{\omega^+}}(s',a'))-Q_{\textcolor{blue}{\omega}}(s,a)\)
- Double DQN:\(\delta = r+\gamma\cdot Q_{\textcolor{red}{\omega^+}}(s',\arg \max_{a'} Q_{\textcolor{blue}{\omega}}(s',a'))-Q_{\textcolor{blue}{\omega}}(s,a)\)
- Dueling DQN:\(Q_{\omega,\alpha,\beta}(s,a) = V_{\omega, \alpha}(s)+A_{\omega,\beta}(s,a)-\max_{\hat{a}\in A} A_{\omega,\beta}(s,\hat{a})\)
- 最后减去最大优势值,保证\(V\)和\(A\)取值唯一而不会随意波动
- 优先经验回放 PER:按照TD误差绝对值对经验池的样本构建优先级,引入两个参数\(\alpha\)和\(\beta\),其中\(\alpha\)控制均匀采样和优先级采样的trade-off,\(\beta\)控制重要性采样的系数
- 多步TD算法
- Noisy Net:在网络参数中添加高斯噪声
- Distributional DQN:使用一个分布代替一个值来估计动作价值
2.2 Policy Gradient
2.2.1 核心公式
Part 1:
\[
\begin{aligned}
\nabla_\theta P(\tau|\theta)&=P(\tau|\theta)\cdot \nabla_\theta \log P(\tau|\theta)\\
P(\tau|\theta)&=\rho_0(s_0)\prod_{t=0}^{T}P(s_{t+1}|s_t,a_t)\cdot \pi_\theta(a_t|s_t)\\
\log P(\tau|\theta)&=\log \rho_0(s_0)\sum_{t=0}^{T}\left(\log P(s_{t+1}|s_t,a_t)+ \log \pi_\theta(a_t|s_t)\right)\\
\nabla_\theta \log P(\tau|\theta)&=\log \rho_0(s_0)\sum_{t=0}^{T}\left(\nabla_\theta \log P(s_{t+1}|s_t,a_t)+ \nabla_\theta \log \pi_\theta(a_t|s_t)\right)\\
&=\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t|s_t)\\
\end{aligned}
\]
Part 2:
\[
\begin{aligned}
\nabla_\theta J(\pi_\theta) &= \nabla_\theta \mathbb{E}_{\tau\sim \pi_\theta}[R(\tau)]\\
&=\nabla_\theta \sum_\tau P(\tau|\theta)\cdot R(\tau)\\
&=\sum_\tau\nabla_\theta P(\tau|\theta)\cdot R(\tau)\\
&=\sum_\tau \left [P(\tau|\theta)\cdot \nabla_\theta \log P(\tau|\theta)\right ]\cdot R(\tau)\\
&=\mathbb{E}_{\tau\sim\pi_\theta} \left [ \nabla_\theta \log P(\tau| \theta)\cdot R(\tau) \right ]\\
&=\mathbb{E}_{\tau\sim\pi_\theta} \left [\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t|s_t)\cdot R(\tau) \right ]\\
\end{aligned}
\]
Part 3:
\[
\begin{aligned}
\hat{g}&=\frac{1}{|D|}\sum_{\tau\in D}\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)\cdot R(\tau)\\
\end{aligned}
\]
2.2.2 若干变种
- 推广到一般形式:\(\nabla_\theta J(\pi_\theta)=\mathbb{E}_{\tau\sim\pi_\theta} \left [\sum_{t=0}^{T}\nabla_\theta \log \pi_\theta(a_t|s_t)\cdot \Phi_t \right ]\)
- REINFORCE:\(\Phi_t=G_t=\sum_{t'=t}^T \gamma^{t'-t}\cdot r_{t'}\)
- REINFORCE with Baseline:\(\Phi_t=G_t-b(s_t)\),\(b(s_t)\)需与\(a\)无关,一般取\(V^\pi(s_t)\)
- Actor-Critic:\(\Phi_t=Q^\pi(s_t,a_t)\)
- A2C:\(\Phi_t=A^\pi(s_t,a_t)=Q^\pi(s_t,a_t)-V^\pi(s_t)=r(s_t,a_t)+\gamma\cdot V^\pi(s_{t+1})-V^\pi(s_t)\)
2.3 TRPO
2.3.1 核心公式
新旧策略的差异:
\[
\begin{aligned}
\rho^\pi(s)&=\sum_{t=0}^\infty\gamma\cdot P^\pi_t(s)\\
\eta(\pi)&=\mathbb{E}_\tau\left[ \sum_{t=0}^T \gamma^t\cdot R(s_t,a_t,s_{t+1}) \right]\\
\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^t\cdot A^\pi(s_t,a_t) \right]
&=\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^t\cdot \left( R(s_t,a_t,s_{t+1})+\gamma V^\pi(s_{t+1})-V^\pi(s_t) \right) \right]\\
&=\eta(\hat{\pi})+\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^{t+1} V^\pi(s_{t+1})- \sum_{t=0}^\infty \gamma^t V^\pi(s_t) \right]\\
&=\eta(\hat{\pi})+\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=1}^\infty \gamma^{t} V^\pi(s_{t})- \sum_{t=0}^\infty \gamma^t V^\pi(s_t) \right]\\
&=\eta(\hat{\pi})-\mathbb{E}_{\tau\sim \hat{\pi}}\left[ V^\pi(s_0) \right]\\
&=\eta(\hat{\pi})-\eta(\pi)\\
\end{aligned}
\]
Surrogate Function:
\[
\begin{aligned}
\eta(\hat{\pi})&=\eta(\pi)+\mathbb{E}_{\tau\sim \hat{\pi}}\left[ \sum_{t=0}^\infty \gamma^t\cdot A^\pi(s_t,a_t) \right]\\
&=\eta(\pi)+ \sum_{s}\rho^{\textcolor{blue}{\hat{\pi}}}(s)\sum_a \hat{\pi}(a|s)\cdot A^\pi(s,a)\\
L_\pi(\hat{\pi})&=\eta(\pi)+ \sum_{s}\rho^{\textcolor{red}{\pi}}(s)\sum_a \hat{\pi}(a|s)\cdot A^\pi(s,a)\\
\eta(\hat{\pi})& \geq L_{\pi}(\hat{\pi})-\frac{4\varepsilon\gamma}{(1-\gamma)^2}\alpha^2 \
\begin{cases}
\varepsilon=\max_{s,a}|A^\pi(s,a)|\\
\alpha = D_{TV}^{max}(\pi,\hat{\pi})\\
D_{TV}^{max}(\pi,\hat{\pi}) = \max D_{TV}(\pi(\cdot|s)||\hat{\pi}(\cdot|s))\\
D_{TV}(p||q)=\frac{1}{2}\sum_{i}|p_i-q_i|
\end{cases}\\
\eta(\hat{\pi})& \geq L_{\pi}(\hat{\pi})-C\cdot D_{KL}^{max}(\pi, \hat{\pi}) \
\begin{cases}
(D_{TV}(p||q))^2 \leq D_{KL}(p||q) \\
C=\frac{4\varepsilon \gamma}{(1-\gamma)^2}\\
\end{cases}\\
\end{aligned}
\]
带约束的优化问题:
\[
\begin{aligned}
\max_\pi L_{\pi_{old}}(\pi) \ \ &\text{s.t.} \ D_{KL}^{max}(\pi_{old}, \pi)\leq \delta \\
\max_\pi L_{\pi_{old}}(\pi) \ \ &\text{s.t.} \ {\textcolor{blue}{\bar{D_{KL}}}} (\pi_{old}, \pi)\leq \delta \\
\max_\pi \sum_{s}\rho^{\pi_{old}}(s)\sum_a \pi(a|s)\cdot A^{\pi_{old}}(s,a) \ \ &\text{s.t.} \ {\bar{D_{KL}}} (\pi_{old}, \pi)\leq \delta \\
\max_\pi \mathbb{E}_{s\sim\rho^{\pi_{old}},a\sim \pi_{old}}\left[ \frac{\pi(a|s)}{\pi_{old}(a|s)}\cdot A^{\pi_{old}}(s,a) \right] \ \ &\text{s.t.} \ {\bar{D_{KL}}} (\pi_{old}, \pi) \leq \delta \\
\end{aligned}
\]
求解优化:KKT条件 + 自然梯度法
2.4 PPO
2.4.1 核心公式
PPO-Penalty:
\[
\begin{aligned}
&\max_\pi \mathbb{E}_{s\sim\rho^{\pi_{old}},a\sim \pi_{old}}\left[ \frac{\pi(a|s)}{\pi_{old}(a|s)}\cdot A^{\pi_{old}}(s,a) - \beta\cdot D_{KL} (\pi_{old}, \pi) \right] \\
&\begin{cases}
\beta_{k+1} = \beta_k/2 & D_{KL} (\pi_{old}, \pi)<\delta/1.5\\
\beta_{k+1} = \beta_k\cdot 2 & D_{KL} (\pi_{old}, \pi)>\delta\cdot 1.5\\
\beta_{k+1} = \beta_k & other\\
\end{cases}
\end{aligned}
\]
PPO-Clip:
\[
\max_\pi \mathbb{E}_{s\sim\rho^{\pi_{old}},a\sim \pi_{old}}\left[ \min \left( \frac{\pi(a|s)}{\pi_{old}(a|s)}\cdot A^{\pi_{old}}(s,a), \text{clip}\left(\frac{\pi(a|s)}{\pi_{old}(a|s)},1+\varepsilon, 1-\varepsilon\right)\cdot A^{\pi_{old}}(s,a) \right )\right]
\]
2.5 DDPG
2.5.1 核心公式
确定性策略梯度定理:
\[
\begin{aligned}
J(\pi_\theta)
&=\mathbb{E}_{s\sim\textcolor{red}{\rho^{\hat{\pi}}}}\left[ Q_\omega(s,\mu_\theta(s)) \right]\\
\nabla_\theta J(\pi_\theta)
&= \frac{\partial Q_\omega(s,\mu_\theta(s))}{\partial \theta}\\
&= \frac{\partial Q_\omega(s,\mu_\theta(s))}{\partial \mu_\theta(s)}\cdot \frac{\partial \mu_\theta(s)}{\partial \theta}\\
&= \nabla_\theta \mu_\theta(s)\cdot \nabla_a Q_\omega(s,a) \ \ \ \ \ \ \ \ \ a=\mu_\theta(s) \\
\end{aligned}
\]
2.5.2 关于Off-Policy
DDPG与Q-Learning的关系密切,且动机相同:如果已知最优动作价值函数\(Q^*(s,a)\),则在任意给定状态\(s\)下,最优动作为\(a^*=\arg \max_a Q^*(s,a)\)。
DDPG假设\(Q\)对\(a\)可微,将最优动作\(a^*\)参数化为\(\mu_\theta(s)\),则有如下近似:\(\max_a Q^*(s,a)\approx Q_\omega(s,\mu_\theta(s))\)。
因此DDPG与DQN一样,是Off-Policy的。
2.6 SAC
2.6.1 核心公式
优化目标—最大熵强化学习:
\[
\pi^*_{MaxEnt}=\arg \max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim \rho^\pi}\left[ r(s_t,a_t)+\alpha\cdot \textcolor{red}{H(\pi(\cdot|s_t))}\right]
\]
Soft Q-function:
\[
\begin{aligned}
Q_{soft}(s_t,a_t)&=r(s_t,a_t)+\mathbb{E}_{\tau\sim\pi}\left[\sum_{l=1}^\infty \gamma^l\cdot [r(s_{t+l},a_{t+l})+\alpha\cdot H(\pi(\cdot|s_{t+l}))]\right]\\
&\Updownarrow\\
r_{soft}(s_t,a_t)&=r(s_t,a_t)+\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]\\
\end{aligned}
\]
Soft Bellman Equation:
\[
\begin{aligned}
Q_{soft}(s_t,a_t)&=\textcolor{red}{r_{soft}(s_t,a_t)}+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]\\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]}\\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[-\alpha\cdot \log \pi(a_{t+1}|s_{t+1})\right]}\\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[\textcolor{green}{Q_{soft}(s_{t+1},a_{t+1})-\alpha\cdot \log(\pi(a_{t+1}|s_{t+1}))}\right] \\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1}}\left[\textcolor{blue}{V_{soft}(s_{t+1})}\right]\\
&\Downarrow\\
V_{soft}(s_t)&=\mathbb{E}_{a_t}\left[Q_{soft}(s_t,a_t)-\alpha\cdot\log\pi(a_t|s_t)\right]\\
\end{aligned}
\]
Policy Improvement:
\[
\begin{aligned}
\pi_{new}=\arg\min_{\pi'\in \Pi} \mathrm{D}_{\mathrm{KL}}\left(\pi'(\cdot|s_t)\bigg\|\frac{\exp(Q^{\pi_\mathrm{old}}(s_t,\cdot))}{Z^{\pi_\mathrm{old}}(s_t)}\right)
\end{aligned}
\]
Optimize Target:
\[
\begin{aligned}
J_{V}(\psi)&=\mathbb{E}_{s_t\sim D}\left[\frac{1}{2}\left(V_\psi(s_t)-\mathbb{E}_{a_t\sim\pi_\phi}\left[Q_\theta(s_t,a_t)-\log\pi_\phi(a_t|s_t)\right]\right)^2\right]\\
J_{Q}(\theta)&=\mathbb{E}_{(s_t,a_t)\sim D}\left[\frac{1}{2}\left(Q_\theta(s_t,a_t)-\hat{Q}_\theta(s_t,a_t)\right)^2\right]\\
J_\pi(\phi)&=\mathbb{E}_{s_t\sim D}\left[\text{D}_{\text{KL}}\left(\pi_{\psi}(\cdot|s_t)\bigg\|\frac{\exp (Q_\theta(s_t,\cdot))}{Z_\theta(s_t)}\right)\right]
\end{aligned}
\]
2.7 TD3
*2.8 Soft Q-Learning
优化目标—最大熵强化学习:
\[
\pi^*_{MaxEnt}=\arg \max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim \rho^\pi}\left[ r(s_t,a_t)+\alpha\cdot \textcolor{red}{H(\pi(\cdot|s_t))}\right]
\]
玻尔兹曼分布:描述粒子处于特定状态下的概率,是关于状态能量与系统温度的函数。设粒子处于状态\(\alpha\)的概率为\(p_\alpha\),状态\(\alpha\)的能量为\(\mathcal{E}_\alpha\),玻尔兹曼常量\(k\),系统温度\(T\),则有:
\[
p_\alpha=\frac{1}{Z}\exp \left( \frac{-\mathcal{E}_\alpha}{k \ T} \right)
\]
其中\(\exp\left(\frac{-\mathcal{E}_\alpha}{k \ T} \right)\)称为玻尔兹曼因子(未归一化),\(Z=\sum_\alpha \exp \left( \frac{-\mathcal{E}_\alpha}{k \ T} \right)\)为配分函数(计算所有状态的总和)。玻尔兹曼分布认为,能量较低的状态总是有较高的概率被占用。
Soft Q-function:
\[
\begin{aligned}
Q_{soft}(s_t,a_t)&=r(s_t,a_t)+\mathbb{E}_{\tau\sim\pi}\left[\sum_{l=1}^\infty \gamma^l\cdot [r(s_{t+l},a_{t+l})+\alpha\cdot H(\pi(\cdot|s_{t+l}))]\right]\\
&\Updownarrow\\
r_{soft}(s_t,a_t)&=r(s_t,a_t)+\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]\\
\end{aligned}
\]
Soft V-function:
\[
\begin{aligned}
V_{soft}(s_t)&=\alpha\cdot \log \int \exp\left(\frac{1}{\alpha}Q_{soft}(s_t,a)\right)\mathrm{d}a\\
&=\log \int \exp (Q_{soft}(s_t,a))\mathrm{d}a \rightarrow \text{LogSumExp, Soft Maximum}
\end{aligned}
\]
Soft Bellman Equation:
\[
\begin{aligned}
Q_{soft}(s_t,a_t)&=\textcolor{red}{r_{soft}(s_t,a_t)}+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]\\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1}}\left[H(\pi(\cdot|s_{t+1}))\right]}\\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[Q_{soft}(s_{t+1},a_{t+1})\right]+\textcolor{red}{\gamma\cdot \alpha \cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[-\alpha\cdot \log \pi(a_{t+1}|s_{t+1})\right]}\\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1},a_{t+1}}\left[\textcolor{green}{Q_{soft}(s_{t+1},a_{t+1})-\alpha\cdot \log(\pi(a_{t+1}|s_{t+1}))}\right] \\
&=r(s_t,a_t)+\gamma\cdot \mathbb{E}_{s_{t+1}}\left[\textcolor{blue}{V_{soft}(s_{t+1})}\right]\\
&\Downarrow\\
V_{soft}(s_t)&=\mathbb{E}_{a_t}\left[Q_{soft}(s_t,a_t)-\alpha\cdot\log\pi(a_t|s_t)\right]\\
\end{aligned}
\]
Policy:直观理解,尽可能让策略分布与Q函数形状相似,即\(\pi(a|s)\propto \exp Q(s,a)\)
\[
\begin{aligned}
\mathcal{E}(s_t,a_t)&=-\frac{1}{\alpha}\cdot Q_{soft}(s_t,a_t)\\
\pi(a_t|s_t)&\propto\exp(-\mathcal{E}(s_t,a_t))\\
&\propto\exp(Q_{soft}(s_t,a_t))\\
\pi(a_t|s_t)&=\exp\left(\frac{1}{\alpha}(Q_{soft}(s_t,a_t)-V_{soft}(s_t))\right)
\end{aligned}
\]
Soft Q-Iteration:
\[
\begin{aligned}
Q_{soft}(s_t,a_t)&\leftarrow r(s_t,a_t)+\gamma\cdot\mathbb{E}_{s_{t+1}}\left[V_{soft}(s_{t+1})\right]\\
V_{soft}(s_t)&\leftarrow \alpha\log\int\exp\left(\frac{1}{\alpha}Q_{soft}(s_t,a)\right)\mathrm{d}a\\
\end{aligned}
\]
- \(V_{soft}\)的迭代更新需要对整个动作空间进行积分
- 参数化网络\(Q_{soft}^\theta(s_t,a_t)\)
- 通过重要性采样转化为随机优化问题
\[
\begin{aligned}
V_{soft}^\theta(s_t)&=\alpha\log\mathbb{E}_{q_{a'}}\left[\frac{\exp (\frac{1}{\alpha}Q_{soft}^\theta(s_t,a'))}{q_{a'}(a')}\right]\\
&\Downarrow\\
J_Q(\theta)&=\mathbb{E}_{s_t\sim \textcolor{red}{q_{s_t}}, a_t\sim \textcolor{blue}{q_{a_t}}}\left[\frac{1}{2}\left(\hat{Q}_{soft}^\theta(s_t,a_t)-Q_{soft}^\theta(s_t,a_t)\right)^2\right]
\end{aligned}
\]
- 最优策略服从energy-based分布,难以采样