Skip to content

1 基本概念

1.1 MDP

折扣回报\(G_t=R_t+\gamma\cdot R_{t+1}+\cdots+\gamma^{n-t}\cdot R_{n}\)

状态价值函数\(V^\pi(s_t)=\mathbb{E}_{A_t\sim \pi(\cdot|s_t)}[Q^\pi (s_t,A_t)]=\sum_{a\in A}\pi(a|s_t)\cdot Q^\pi(s_t,a)\)

贝尔曼期望方程

\[ \begin{aligned} Q^\pi(s_t,a_t)&= \begin{cases} \mathbb{E}_{S_{t+1},A_{t+1}}[R_t+\gamma\cdot Q^\pi(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t]&Q^\pi\rightarrow Q^\pi\\ \mathbb{E}_{S_{t+1}}[R_t+\gamma\cdot V^\pi(S_{t+1})|S_t=s_t,A_t=a_t]&Q^\pi\rightarrow V^\pi\\ \end{cases} \\ \\ V^\pi(s_t)&=\mathbb{E}_{S_{t+1},A_t}[R_t+\gamma\cdot V^\pi(S_{t+1})|S_t=s_t] \end{aligned} \]

贝尔曼最优方程

\[ \begin{aligned} Q^*(s_t,a_t) &=\max_{\pi} Q^\pi(s_t,a_t)\\ &=r(s_t,a_t)+\gamma\cdot\sum_{s_{t+1}\in S}P(s_{t+1}|s_t,a_t)\cdot \max_{a_{t+1}\in A} Q^*(s_{t+1}, a_{t+1}) \\ \\ V^*(s_t) &=\max_{\pi} V^\pi(s_t)\\ &=\max_{a_{t}\in A} \left (r(s_t,a_t)+\gamma\cdot\sum_{s_{t+1}\in S}P(s_{t+1}|s_t,a_t)\cdot V^*(s_{t+1})\right ) \end{aligned} \]

状态访问分布\(\mu^\pi(s)=(1-\gamma)\sum_{t=0}^\infty \gamma^t\cdot P_t^\pi(s)\)

占用度量\(\rho^\pi(s,a)=\mu^\pi(s)\cdot \pi(a|s)=(1-\gamma)\sum_{t=0}^\infty \gamma^t\cdot P_t^\pi(s)\cdot \pi(a|s)\)

1.2 强化学习分类

  • Model-based
  • Model-free
  • Value-based
  • Policy-based
  • Actor-Critic

1.3 On / Off-Policy

  • 行为策略(Behavior Policy):在收集经验用于学习策略函数这一过程中,控制智能体与环境交互的策略
  • 收集经验(即观测的状态、动作、奖励)
  • 目标策略(Target Policy):经过训练得到的策略函数
  • 控制智能体的实际行动

同策略(On-policy)使用相同的行为策略和目标策略进行强化学习

异策略(Off-policy)使用不同的行为策略和目标策略进行强化学习