1 Vanilla Policy Gradient的缺陷与新的优化目标

VPG优化策略的方法就是通过Policy Gradient进行梯度上升。梯度上升是对函数点的一阶近似，根据步长更新，一阶近似也就是超平面拟合局部，步长稍微大一点误差还是挺大的，容易优化出错。
强化学习对策略优化出错相当敏感。

对于监督学习的分类任务，网络的输入取自同一个训练集，也就是说输入信息具有相同的概率分布，无论网络参数怎么变，都不会影响输入的概率分布，也就是说，对优化出错不太敏感，损失函数曲线总是有波折，但总体还是稳定下降的状态。

对于强化学习任务，策略是直接决定了采样数据的，每一次更新策略都会影响后续的采样数据，也就是说网络参数改变会引起输入数据的概率分布发生变化。

优化出错得到差的策略，差的策略得到差的采样数据，差的采样数据难以进行质量好的策略更新。

因此对于强化学习任务，往往一次优化出错会导致学习进度崩溃，平均回报变得相当低，并且很难回升。总得来说，强化学习任务会更加依赖于更新的稳定性。

所以能够想到的是，牛顿法这种在局部找到确定的最优点进行更新的优化方法会更加适合于强化学习的策略更新。

在VPG中，优化目标是使累积回报最大。
$maximize_\pi J(\pi)=\mathbb{E}_{\tau\sim\pi}[\Sigma_{t=0}^\infty\gamma^tr(s_t)]$
在TRPO中，沿用牛顿法的思路，我希望新的策略一定能够使得策略性能提升，并且提升最大
$maximize_{\pi'}J(\pi')-J(\pi)$
其中 $\pi'$ 是待寻找的新策略， $\pi$ 是老策略。
可以证明 $J(\pi')-J(\pi)=\mathbb{E}_{\pi'}[\Sigma_{t=0}^{\infty}\gamma^tA^\pi(s_t,a_t)]$ ，注意 $A^\pi$ 符号表示使用 $\pi$ 进行采样得到的Advantage function。
$\begin{aligned} \mathbb{E}_{\pi'}[\Sigma_{t=0}^{\infty}\gamma^tA^\pi(s_t,a_t)]&=\mathbb{E}_{\pi'}[\Sigma_{t=0}^{\infty}\gamma^t(r_t+\gamma V_{t+1}-V_t)] \\ &= \mathbb{E}_{\pi'}[\Sigma_{t=0}^{\infty}\gamma^tr_t]+\mathbb{E}_{\pi'}[\Sigma_{t=0}^{\infty}\gamma^{t+1} V_{t+1}-\gamma^tV_t)] \\ &= J(\pi')+\mathbb{E}_{\pi'}[\Sigma_{t=1}^\infty \gamma^tV_t-\Sigma_{t=0}^\infty\gamma^tV_t] \\ &= J(\pi')+\mathbb{E}_{\pi'}[-V(s_0)] \\ &= J(\pi')-J(\pi) \end{aligned}$
其中 $\mathbb{E}_\pi$ 表示用 $\pi$ 生成的轨迹做期望。

继续对 $\mathbb{E}_{\pi'}[\Sigma_{t=0}^{\infty}\gamma^tA^\pi(s_t,a_t)]$ 进行一些改写。
记 $\eta(s_t=s|\pi)$ 为策略 $\pi$ 下 $t$ 时刻状态变为 $s$ 的转移概率，服从某个分布。
以此构造一个新的分布
$\rho_{\pi'}(s)=(1-\gamma)\Sigma_{t=0}^{\infty}\gamma^t\eta(s_t=s|\pi')$
显然对于任意 $s\in\mathbb{S}_{state\ space}$ ， $(1-\gamma)\Sigma_{t=0}^{\infty}\gamma^t\eta(s_t=s|\pi')\in[0,1]$
并且 $\Sigma_{s\in\mathbb{S}}(1-\gamma)\Sigma_{t=0}^{\infty}\gamma^t\eta(s_t=s|\pi')=(1-\gamma)(1+\gamma+\gamma^2+\cdots)=(1-\gamma)(\frac{1}{1-\gamma})=1$

构造这个分布是为了进一步化简之前的式子，只具有数学意义，不需要思考这个分布的现实意义。

于是有
$\begin{aligned} \mathbb{E}_{\pi'}[\Sigma_{t=0}^{\infty}\gamma^tA^\pi(s_t,a_t)]&=\Sigma_{t=0}^{\infty}\Sigma_s\eta(s|\pi')\Sigma_a\pi'(a|s)\cdot \gamma^tA^\pi(s,a) \\ &= \frac{1}{1-\gamma}\Sigma_s(1-\gamma)\Sigma_{t=0}^{\infty}\gamma^t\eta(s|\pi')\Sigma_a\pi'(a|s)\cdot A^\pi(s,a) \\ &= \frac{1}{1-\gamma}\mathbb{E}_{s\sim\rho_{\pi'},a\sim \pi'}[A^\pi(s,a)] \end{aligned}$
这个式子表明只要新策略能够使得Advantage function在轨迹上的期望大于零，就一定能使策略的性能提升。

但这个式子的实践意义不好，因为期望都是对新策略 $\pi'$ 进行的，这个策略是待求的，求都没有求出来怎么进行期望求取呢，所以要代换成关于 $\pi$ 的期望。

首先利用重要度采样(Importance Sampling)，可以将动作对 $\pi'$ 的依赖修改掉。
$\mathbb{E}_{s\sim\rho_{\pi'},a\sim \pi'}[A^\pi(s,a)]=\mathbb{E}_{s\sim\rho_{\pi'},a\sim \pi}[\frac{\pi'(a|s)}{\pi(a|s)}A^\pi(s,a)]$
其次只要保证每次更新比较小，只在 $\pi$ 的周围更新，那么 $\pi$ 也能当做对于 $\pi'$ 的近似。
$\mathbb{E}_{s\sim\rho_{\pi'},a\sim \pi}[\frac{\pi'(a|s)}{\pi(a|s)}A^\pi(s,a)]\approx\mathbb{E}_{s\sim\rho_{\pi},a\sim \pi}[\frac{\pi'(a|s)}{\pi(a|s)}A^\pi(s,a)]$
综上，得到在 $\pi$ 的附近，有 $J(\pi')$ 的近似
$L_\pi(\pi')=J(\pi)+\mathbb{E}_{s\sim\rho_{\pi},a\sim \pi}[\frac{\pi'(a|s)}{\pi(a|s)}A^\pi(s,a)]$

2 MM优化(Minorize-Maximization)与下界函数

MM优化的目的在于转移优化目标，实际优化一个代理函数，但是确保原函数也有提升。一般用于原函数很复杂不方便优化，但是代理函数很方便优化的场合。

暂时抛开强化学习，先看看MM优化做的事情。
为了增大 $f(\theta)$ ，寻找一个下界函数 $g(\theta|\theta^{(k)})$ ，使得
$g(\theta|\theta^{(k)})\le f(\theta),\forall\theta \in \Theta \\ g(\theta^{(k)}|\theta^{(k)}) = f(\theta^{(k)})$
此时只需要优化 $g(\theta|\theta^{(k)})$ ，令 $\theta^{(k+1)}=arg_\theta max \ g(\theta|\theta^{(k)})$
于是有
$f(\theta^{(k+1)})\ge g(\theta^{(k+1)}|\theta^{(k)})\ge g(\theta^{(k)}|\theta^{(k)})=f(\theta^{(k)})$
简介完成了对 $f(\theta)$ 的优化。

回到TRPO，原文中用了很多篇幅证明了下式
$J(\pi')\ge L_\pi(\pi')-CD^{max}_{KL}(\pi, \pi')，where\ C=\frac{4\epsilon\gamma}{(1-\gamma)^2}$
其中的max表示对所有s的分布的最大KL散度。这个式子的证明不是重点，假设已经得到了它。
可以发现 $L_\pi(\pi')-CD^{max}_{KL}(\pi, \pi')$ 就是 $J(\pi')$ 的下界函数，所以可以使用MM优化。
这一点是至关重要的，因为 $J(\pi')$ 与待求解的 $\pi'$ 有关，根本无法进行优化。
$J(\pi')\ge L_\pi(\pi')-CD^{max}_{KL}(\pi, \pi') \\ J(\pi)= L_\pi(\pi)-CD^{max}_{KL}(\pi, \pi)=J(\pi)-0$
由于C是不确定的参数，所以不妨把它看做对于KL散度的惩罚系数，对下界函数的优化问题可以转化成硬约束。
$maximize_{\theta'}L_\theta(\theta') \\ subject\ to\ D^{max}_{KL}(\theta, \theta') \le\delta$
从这里开始，把 $\pi$ 替换成了 $\theta$ ，因为策略完全取决于网络参数，但是并没有改变含义，这里的KL散度仍然指的是两个策略的KL散度。

$D^{max}_{KL}(\theta, \theta')$ 是无法获知的，毕竟所有的数据都是采样获得，所以原文提出用均值取代最大值，如果均值满足约束，那么最大值也是满足约束的。
$maximize_{\theta'}L_\theta(\theta') \\ subject\ to\ \bar D_{KL}(\theta, \theta') \le\delta$

3 求解优化

代入之前求解的 $L_\pi(\pi')$ ，得到
$maximize_{\theta'}\mathbb{E}_{s\sim\rho_{\pi},a\sim \pi}[\frac{\pi'(a|s)}{\pi(a|s)}A^\pi(s,a)]\\ subject\ to\ \bar D_{KL}(\theta, \theta') \le\delta$

求解这个优化问题。
对 $L_\pi(\pi')$ 在 $\pi$ 附近一阶展开
$L_\theta(\theta')\approx L_\theta(\theta)+\nabla_{\theta'}L_\theta(\theta')|_{\theta'=\theta}$
因此对策略进行梯度上升，且梯度为
$\begin{aligned} g&=\nabla_{\theta'}L_\theta(\theta')|_{\theta'=\theta} \\ &=\nabla_{\theta'}(J(\theta')-J(\theta))|_{\theta'=\theta} \\ &=\nabla_{\theta'}J(\theta')|_{\theta'=\theta} \end{aligned}$
就是策略梯度了。

对约束项做 $\theta$ 附近的二阶展开。
$\bar D_{KL}(\theta, \theta') =\bar D_{KL}(\theta, \theta)+\nabla_{\theta'}\bar D_{KL}(\theta, \theta')|_{\theta'=\theta}+\frac{1}{2}(\theta-\theta')^TH(\theta-\theta')$
由于取的是小领域内的极值，所以第一项和第二项都是零。约束近似为
$\frac{1}{2}(\theta'-\theta)^TH(\theta'-\theta)\le\delta$
其中H为KL散度的Hessian矩阵，其实就是Fisher Information矩阵。
优化问题的解为
$\theta'=\theta+\sqrt{\frac{2\delta}{x^THx}}x$
其中 $x=H^{-1}g$

因为要求解二阶微分的逆矩阵，所以运算量会特别大，对于参数越多的矩阵，运算量指数增加。
在TRPO中使用了Hessian vector product和conjugate gradient方法进行运算。
在ACKTR中使用了kronecker product拆分矩阵减小复杂度运算。
但这些方法都不是重点，所以可以单独去了解。

因此实际的学习过程与VPG不同

计算策略梯度
使用Hessian vector product计算Hx，使用conjugate gradient计算x
从j=0开始，计算新策略参数 $\theta'=\theta+\alpha^j\sqrt{\frac{2\delta}{x^THx}}x$ ，如果满足约束 $\frac{1}{2}(\theta'-\theta)^TH(\theta'-\theta)\le\delta$ 就对网络参数执行更新，否则j+=1。若j达到预设的最大值仍然不满足约束，就不进行更新
使用MSE损失更新价值函数

4 代码部分

4.1 处理网络参数

对于第一次看到TRPO，可能最困惑的就是计算过程中对于参数 $\theta$ 的各种运算，因为对于神经网络而言， $\theta$ 就是网络的各层参数，每一层就是一个张量的参数。
实际在算法中，会把网络参数提出来，展平成一个一维向量，也就是说具有2000000个参数的网络，会得到一个2000000长度的向量进行运算，而且Hessian矩阵的规模响应变成(2000000,2000000)，可以直观的感受到为什么网络一复杂，运算量就会显著增加。

因此相比与VPG中写的那些函数，TRPO需要有将网络参数提取，并且展平的函数，还有将展平后的参数打包回原本格式的函数。

# 将可迭代变量xs展平成一维向量
def flat_concat(xs):
    return torch.cat([torch.reshape(x, (-1,)) for x in xs])

# 根据给定的形状，将一维向量打包成指定形状
def concat_flat(x, shape_list):
    numel_list = [x.numel() for x in shape_list]
    pre = torch.split(x, numel_list)
    concat_x = [torch.reshape(x, s) for x, s in zip(pre, shape_list)]
    return concat_x
# pytorch的神经网络参数是以字典形式给出的，这里除了打包以外，多了一步封装成字典的操作
def concat_flat_for_nn(x, shape_list, key_list):
    concat_x = concat_flat(x, shape_list)
    return dict([(k,v) for k,v in zip(key_list, concat_x)])

4.2 实用算法

conjugate gradient，一般来说cg_iter=20时就能给出一个比较精确的估计了

def cg(Ax, b, cg_iter):
        '''
        Conjugate gradient algorithm\n
        for a given Ax=b, approximate x
        '''
        if use_cuda:
            b = b.detach().cpu().numpy()
        else:
            b = b.detach().numpy()
        x = np.zeros_like(b)
        r = b.copy()
        p = r.copy()
        r_dot_old = np.dot(r,r)
        for _ in range(cg_iter):
            z = Ax(torch.as_tensor(p))
            alpha = r_dot_old / (np.dot(p, z) + EPS)
            x += alpha * p
            r -= alpha * z
            r_dot_new = np.dot(r, r)
            p = r + (r_dot_new / (r_dot_old+EPS)) * p
            r_dot_old = r_dot_new
        return x

Hessian vector product，避免了直接计算Hessian矩阵

def hessian_vector_product(f, v, model):
    '''
    H = grad**2 f, compute Hv
    '''
    g = torch.autograd.grad(f, model.parameters(), create_graph=True)
    # prod = sum([(g*v).sum() for g,v in zip(g, v)])
    g = flat_concat(g)
    prod = (g*v).sum()
    h = torch.autograd.grad(prod, model.parameters(), create_graph=True, allow_unused=True)
    return h

4.3 TRPO算法部分

流程与前一篇VPG几乎一致，只需要看update函数部分，看TRPO的更新如何实现就行。

def trpo(
    env_fn,
    actor_critic=core.MLPActorCritic,
    ac_kwargs=dict(),
    seed=0,
    steps_per_epoch=4000,
    epochs=50,
    gamma=0.99,
    vf_lr=1e-3,
    train_v_iters=80,
    lam=0.97,
    max_ep_len=1000,
    delta = 0.01,
    damping_coeff=0.1,
    cg_iter = 20,
    line_search_iter = 10,
    line_search_coeff = 0.8,
    save_path = "",
    save_freq = 10
):
	'''
	Trust Region Policy Optimization
	参数解释可以看第一篇博客VPG的代码部分，这里只解释新实用的参数
	delta	KL散度约束项
	damping_coeff	计算Hessian逆矩阵的时候有可能不可逆，所以在对角上增加一些很小的数，用来求逆
	cg_iter	conjugate gradient 估计迭代次数
    line_search_iter	计算是否符合约束的最大次数，一般都符合约束，不用迭代，这个参数不重要
    line_search_coeff	更新衰减系数，就是更新公式里的alpha
	'''
    torch.manual_seed(seed)
    np.random.seed(seed)
        
    env = env_fn()

    obs_dim = env.observation_space.shape
    act_dim = env.action_space.shape
    
    ac = actor_critic(env.observation_space, env.action_space, **ac_kwargs)
    # 为了估计新的策略性能有没有提升，设定一个测试用的ac
    test_ac = actor_critic(env.observation_space, env.action_space, **ac_kwargs)

	# 为了能够操作网络参数的形状，需要获取一些形状信息
    shape_list = [x.shape for x in ac.pi.state_dict().values()]
    key_list = ac.pi.state_dict().keys()
    
    history = {}
    history["reward"] = []
    
    buf = GAEBuffer(obs_dim, act_dim, steps_per_epoch, gamma, lam)
    
    def cg(Ax, b, cg_iter):
        '''
        Conjugate gradient algorithm\n
        for a given Ax=b, approximate x
        '''
        if use_cuda:
            b = b.detach().cpu().numpy()
        else:
            b = b.detach().numpy()
        x = np.zeros_like(b)
        r = b.copy()
        p = r.copy()
        r_dot_old = np.dot(r,r)
        for _ in range(cg_iter):
            z = Ax(torch.as_tensor(p))
            alpha = r_dot_old / (np.dot(p, z) + EPS)
            x += alpha * p
            r -= alpha * z
            r_dot_new = np.dot(r, r)
            p = r + (r_dot_new / (r_dot_old+EPS)) * p
            r_dot_old = r_dot_new
        return x

    def compute_loss_pi(data):
        obs, act, adv, logp_old = data['obs'], data['act'], data['adv'], data['logp']
        pi, logp = ac.pi(obs, act)
        loss_pi = (logp*adv).mean()
        kl_div = (logp_old-logp).mean()
        return loss_pi, kl_div, pi.entropy()
    
    def compute_loss_v(data):
        obs, ret = data['obs'], data['ret']
        return ((ac.v(obs) - ret)**2).mean()
    
    def compute_new_pi_loss(deltaTheta, data, flatten_old_theta):
        if use_cuda:
            new_theta = flatten_old_theta + torch.as_tensor(deltaTheta).to('cuda')
        else:
            new_theta = flatten_old_theta + torch.as_tensor(deltaTheta)
        new_theta = core.concat_flat_for_nn(new_theta, shape_list, key_list)
        test_ac.pi.load_state_dict(new_theta)
        
        obs, act, adv = data['obs'], data['act'], data['adv']
        _, logp = test_ac.pi(obs, act)
        loss_pi_new = (logp*adv).mean() 

        return loss_pi_new, new_theta
    
    vf_optimizer = torch.optim.Adam(ac.v.parameters(), lr=vf_lr)
    
    def update():
        data = buf.get()
        
        loss_pi, kl_div, ent = compute_loss_pi(data)
        
        # 计算策略梯度，并且展开成向量
        g_concat = torch.autograd.grad(loss_pi, ac.pi.parameters(), create_graph=True)
        g = core.flat_concat(g_concat)
        
        # 用Hessian vector product 计算 Hx 
        if use_cuda:
            Hx = lambda x : (core.flat_concat(core.hessian_vector_product(kl_div, x.to('cuda'), ac.pi))+damping_coeff*x.to('cuda')).detach().cpu().numpy()
        else:
            Hx = lambda x : (core.flat_concat(core.hessian_vector_product(kl_div, x, ac.pi))+damping_coeff*x).detach().numpy()
        # 用conjugate gradient计算 x = H-1 * g 
        x = cg(Hx, g, cg_iter)
        deltaTheta = np.sqrt(2*delta/np.abs(np.dot(x, Hx(torch.as_tensor(x)))+EPS))*x
        
        # 判断是否满足约束
        do_update = False
        flatten_old_theta = core.flat_concat(ac.pi.parameters())
        for j in range(line_search_iter):
            tmp_dTheta = deltaTheta*(line_search_coeff**j)
            loss_pi_new, new_state_dict = compute_new_pi_loss(tmp_dTheta, data, flatten_old_theta)
            Dkl_mean = 0.5*np.dot(tmp_dTheta, Hx(torch.as_tensor(tmp_dTheta)))
            # 如果策略性能有提升，并且KL散度满足约束，则更新策略参数
            if loss_pi_new > loss_pi and abs(Dkl_mean) <= delta and Dkl_mean is not None:
                do_update = True
                deltaTheta = tmp_dTheta
                print(f'Accepting new params at step {j+1} of line search.')
                break
        else:
            print('Line search failed! Keeping old params.')
            
        if do_update:
            ac.pi.load_state_dict(new_state_dict)
		# 优化价值函数
        for _ in range(train_v_iters):
            vf_optimizer.zero_grad()
            loss_v = compute_loss_v(data)
            loss_v.backward()
            vf_optimizer.step()
        return kl_div, loss_v, Dkl_mean, ent

    o, ep_ret, ep_len = env.reset(), 0, 0
    
    for epoch in range(epochs):
        mean_ep_ret = 0
        ep_ret_idx = 0
        mean_ep_len = 0
        ep_len_idx = 0
        max_ret = -99999
        min_ret = 999999
        for t in range(steps_per_epoch):
            a, v, logp = ac.step(torch.as_tensor(o, dtype=torch.float32))
            next_o, r, d, _ = env.step(a)
            ep_ret += r
            ep_len += 1
            
            buf.store(o,a,r,v,logp)
            
            o = next_o
            
            timeout = (ep_len == max_ep_len)
            terminal = (d or timeout)
            epoch_ended = (t == steps_per_epoch-1)
            
            if terminal or epoch_ended:
                if epoch_ended and not terminal:
                    pass
                if timeout or epoch_ended:
                    _, v, _ = ac.step(torch.as_tensor(o, dtype=torch.float32))
                else:
                    v = 0
                buf.finish_path(v)
                if terminal:
                    mean_ep_ret = (ep_ret_idx/(ep_ret_idx+1))*mean_ep_ret + (1/(ep_ret_idx+1))*ep_ret
                    mean_ep_len = (ep_len_idx/(ep_len_idx+1))*mean_ep_len + (1/(ep_len_idx+1))*ep_len
                    ep_ret_idx += 1
                    ep_len_idx += 1
                    if ep_ret > max_ret:
                        max_ret = ep_ret
                    if ep_ret < min_ret:
                        min_ret = ep_ret
                o, ep_ret, ep_len = env.reset(), 0, 0
                
        kl, loss_v, KL_theta, ent = update()
        
        print(f'Epoch: {epoch+1}')
        print('------------------------------------')
        print(f'EpRet: {mean_ep_ret}')
        print(f'EpLen: {mean_ep_len}')
        print(f'KL: {kl}')
        print(f'Entropy: {ent.mean()}')
        print(f'KLtheta: {KL_theta}')
        print(f'LossV: {loss_v}')
        print(f'MaxRet: {max_ret}')
        print(f'MinRet: {min_ret}')
        print('------------------------------------\n')
        print('当前进程的内存使用：%.4f GB' % (psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024 / 1024) )
        history['reward'].append(mean_ep_ret)
        
        if epoch%save_freq == 0 and save_path != "":
            torch.save(ac.state_dict(), save_path)
        
    return history

在这里插入图片描述
上一篇使用VPG对LunarLander环境进行300个epoch的训练才能得到120分，
这一次用TRPO只用了20个epoch就达到了。
性能大大增加
能很快地降落，接近地面时减速，调整位置，落到目标区间

但是TRPO毕竟是要计算二阶量还有逆矩阵，面对复杂的网络会有很高的计算量。
往后一篇会继续PPO，将优化改回一阶，提高了学习速度。