【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning-CFANZ编程社区

我已经有两年 ML 经历，这系列课主要用来查缺补漏，会记录一些细节的、自己不知道的东西。

关于强化学习，我专门花半年时间学习实践过，因此这里笔记只记录李老师的 outline 。我的强化学习资源仓库：
https://github.com/PiperLiu/Reinforcement-Learning-practice-zh 我的 CSDN 强化学习博客集合

本节内容综述

首先介绍 Actor-Critic 。重点介绍 A3C 。
首先复习 PG 。PG 中，如果采样不够，会比较不稳定，因此考虑用期望值代替实际采样的值。最后引出并行的 A3C 。
接着，介绍了 Pathwise derivative policy gradient 。在 Q-Learning 中好用的技巧，这里都可以用到。
之后进入下一部分，讲一下 Sparse Reward 的问题，分为三部分。
第一部分为 Reward Shaping ，其中提到了 好奇心 这种技术。
第二部分为 Curriculum Learning 。
第三部分为有层次的强化学习 Hierarchical Reinforcement Learning 。
最后，进入 Imitation Learning ，探Ivess讨在没有奖励值时，怎么办。两个思路。
第一个：Behavior Cloning。但是，这将导致很多问题。
第二个： Inverse Reinforcement Learning (IRL) 。

文章目录

本节内容综述小细节

Actor-Critic

Advantage Actor-Critic

Tips

Asynchronous

Pathwise derivative policy gradient
Sparse Reward

Reward Shaping

Curiosity

Curriculum Learning

Reverse Curriculum Generation

Hierarchical Reinforcement Learning

Imitation Learning

Behavior Cloning

Dataset Aggregation
Mismatch

Inverse Reinforcement Learning

Framework of IRL
Example
Third Person Imitation Learning

小细节

Actor-Critic

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_数据

如上，PG 中，如果采样不够，会比较不稳定，因此考虑用期望值代替实际采样的值。即用 Q - V 。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_人工智能_02

但是，这样可能需要训练两个网络。我们做个变换，利用公式 Q t = E [ r t + V ( s t + 1 ) ] Q_t=E[r_t+V(s_{t+1})] Qt=E[rt+V(st+1)] 。但是，这样其实引入了一个随机的东西，即 r r r。

Advantage Actor-Critic

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_数据_03

如上，用 π \pi π 去与环境互动，得到数据，通过 TD 或 MC 的方法，从数据中得到 V V V 的估值，并且按照上式对 π \pi π 进行梯度下降。

Tips

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_深度学习_04

如上，我们的 π ( s ) \pi(s) π(s) 与 V π ( s ) V^\pi(s) Vπ(s) 其实可以共享参数。

此外，还可以对 π ( s ) \pi(s) π(s) 做熵值的正则。

Asynchronous

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_数据_05

如上，同时开几个 worker （一个 worker 一个 CPU），并行地与环境互动，计算出梯度，传回去。

Pathwise derivative policy gradient

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_深度学习_06

在 Pathwise derivative policy gradient 中，不但告诉 actor 好坏，还告诉其该怎么做。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_人工智能_07

如上，使用了 GAN 的架构。Q 与 Actor 分别训练。在训练 Actor 时，就将 Q 固定住。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_数据_08

其演算法如上。

此外，在 Q-Learning 中好用的技巧，这里都可以用到。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_深度学习_09

如上，如果需要求 a ，直接从 π ( s ) \pi(s) π(s) 输出就好。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_深度学习_10

但是，如上，其与GAN很相似，但是都是很难训练的。

Sparse Reward

多数情况下，agent 没有办法得到奖励值，这对其学习很难。

Reward Shaping

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_人工智能_11

如上，为了帮助智能体“远视”，则需要人为干预智能体，让其选择“为了长远考虑”的动作。

总之，就是不使用环境的奖励，而是自己设置些新的 Reward ，引导机器。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_人工智能_12

如上，是一些例子。此外，Reward Shaping 可能对机器起到一些误导。

Curiosity

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_数据_13

在 Curiosity 技术中，加上一个新的奖励函数ICM。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_IRL_14

在最原始的 ICM 中，让机器自己预测 s t + 1 s_{t+1} st+1，并且与真实的 s t + 1 s_{t+1} st+1相比，差距越大，ICM越大。这个 Network 1 是另外训练出来的。

但是，有些状态是很难预测的，因此，光是由好奇心是不够的。

所以，我们加了一个 Network 2 ，把“风吹草动”这种无关状态过滤掉。那么怎么训练这个 Feature Ext 呢？让这个 Feature Ext 根据 ( s t , s t + 1 ) (s_t,s_{t+1}) (st,st+1) 输出 a （从 s t s_t st 到 s t + 1 s_{t+1} st+1 要采取哪一个 action），a 与真正的 a 越接近越好。

Curriculum Learning

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_人工智能_15

让机器上课，循序渐进。

Reverse Curriculum Generation

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_IRL_16

如上，如果不想认为设置课程，则可以生成某些状态，并且得到难度的评分：

Given a goal state s g s_g sg.
Sample some states s 1 s_1 s1 “close” to s g s_g sg.
Start from states s 1 s_1 s1, each trajectory has reward R ( s 1 ) R(s_1) R(s1).
Delete s 1 s_1 s1 whose reward is too large (already learned) or too small (too difficult at this moment).
Sample s 2 s_2 s2 from s 1 s_1 s1, start from s 2 s_2 s2.

Hierarchical Reinforcement Learning

上层 agent 的提出愿景，下层的 agent 去实现它。最终的目标就是得到 reward 。如果下层的 agent 完成不了上层的愿景，上层就会被“讨厌”，得到一个惩罚。

Imitation Learning

Behavior Cloning

其实 Behavior Cloning 与监督学习一模一样。

但是有一些问题：

数据过于局限。
因此引出 Dataset Aggregation 。

Dataset Aggregation

希望收集更多样性的数据。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_强化学习_17

如上，让 π 1 \pi_1 π1 按照自己的想法开，收集此时 E x p e r t Expert Expert 的想法。

但是，这会导致另一个问题：

机器完全模仿人的行为。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_IRL_18

此外，机器可能无法完整地学习下来所有信息，可能在有矛盾有噪声时，会选择学习错误的信息。

Mismatch

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_深度学习_19

此外，在强化学习中 a 会影响下面的 s ，但是我们学到的 π ^ \hat{\pi} π^ 可能与被学到的 π ∗ \pi^* π∗ 有误差，这就导致一步错、步步错。

Inverse Reinforcement Learning

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_强化学习_20

如上，通过数据反推出 Reward Function 。

Framework of IRL

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_人工智能_21

如上，先射箭，再画靶。认为专家行为就是最好的，目标也想着这个训练。

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_深度学习_22

如上，其实就是使用了 GAN 的架构。

Example

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_IRL_23

如上，IRL一般只使用几笔数据，学习开车的风格。

Third Person Imitation Learning

【李宏毅2020 ML/DL】P115-117 Actor-Critic & Sparse Reward & Imitation Learning_深度学习_24

此外，还可以有第三人称视角的学习。