1 基本概念
1.1 MDP
折扣回报:\(G_t=R_t+\gamma\cdot R_{t+1}+\cdots+\gamma^{n-t}\cdot R_{n}\)
状态价值函数:\(V^\pi(s_t)=\mathbb{E}_{A_t\sim \pi(\cdot|s_t)}[Q^\pi (s_t,A_t)]=\sum_{a\in A}\pi(a|s_t)\cdot Q^\pi(s_t,a)\)
贝尔曼期望方程:
\[
\begin{aligned}
Q^\pi(s_t,a_t)&=
\begin{cases}
\mathbb{E}_{S_{t+1},A_{t+1}}[R_t+\gamma\cdot Q^\pi(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t]&Q^\pi\rightarrow Q^\pi\\
\mathbb{E}_{S_{t+1}}[R_t+\gamma\cdot V^\pi(S_{t+1})|S_t=s_t,A_t=a_t]&Q^\pi\rightarrow V^\pi\\
\end{cases}
\\
\\
V^\pi(s_t)&=\mathbb{E}_{S_{t+1},A_t}[R_t+\gamma\cdot V^\pi(S_{t+1})|S_t=s_t]
\end{aligned}
\]
贝尔曼最优方程:
\[
\begin{aligned}
Q^*(s_t,a_t)
&=\max_{\pi} Q^\pi(s_t,a_t)\\
&=r(s_t,a_t)+\gamma\cdot\sum_{s_{t+1}\in S}P(s_{t+1}|s_t,a_t)\cdot \max_{a_{t+1}\in A} Q^*(s_{t+1}, a_{t+1})
\\
\\
V^*(s_t)
&=\max_{\pi} V^\pi(s_t)\\
&=\max_{a_{t}\in A} \left (r(s_t,a_t)+\gamma\cdot\sum_{s_{t+1}\in S}P(s_{t+1}|s_t,a_t)\cdot V^*(s_{t+1})\right )
\end{aligned}
\]
状态访问分布:\(\mu^\pi(s)=(1-\gamma)\sum_{t=0}^\infty \gamma^t\cdot P_t^\pi(s)\)
占用度量:\(\rho^\pi(s,a)=\mu^\pi(s)\cdot \pi(a|s)=(1-\gamma)\sum_{t=0}^\infty \gamma^t\cdot P_t^\pi(s)\cdot \pi(a|s)\)
1.2 强化学习分类
- Model-based
- Model-free
- Value-based
- Policy-based
- Actor-Critic
1.3 On / Off-Policy
- 行为策略(Behavior Policy):在收集经验用于学习策略函数这一过程中,控制智能体与环境交互的策略
- 收集经验(即观测的状态、动作、奖励)
- 目标策略(Target Policy):经过训练得到的策略函数
- 控制智能体的实际行动
同策略(On-policy)使用相同的行为策略和目标策略进行强化学习
异策略(Off-policy)使用不同的行为策略和目标策略进行强化学习