华为政企数据中心网络交换机产品集-CFANZ编程社区

华为政企数据中心网络交换机产品集

我继续在学习《ML Lecture 23-1: Deep Reinforcement Learning by Hung-yi Lee》中的视频教程https://youtu.be/W8XF3ME8G2I?si=zEQ3qj_iXzZZ-n85，其中提到：
“”"
$\begin{aligned} & \begin{array}{l} \text { Gradient Ascent } \\ \theta^{\text {new }} \leftarrow \theta^{\text {old }}+\eta \nabla \bar{R}_{\theta^{\text {old }}} \end{array} \quad=\sum_{t=1}^T \nabla \log p\left(a_t \mid s_t, \theta\right) \\ & \nabla \bar{R}_\theta \approx \frac{1}{N} \sum_{n=1}^N R\left(\tau^n\right) \nabla \log P\left(\tau^n \mid \theta\right)=\frac{1}{N} \sum_{n=1}^N R\left(\tau^n\right) \sum_{t=1}^{T_n} \nabla \log p\left(a_t^n \mid s_t^n, \theta\right) \\ & =\frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} R\left(\tau_o^n\right) \nabla \underline{\log } p\left(a_t^n \mid s_t^n, \theta\right) \\ & \end{aligned}$
“”"
“这里的 Gradient Ascent 的微分是很符合人类直觉的， $R\left(\tau^n\right)$ 为正则会提升获得此次胜利的过程中采取的每一次动作的概率；而 $R\left(\tau^n\right)$ 为负，则会降低这些动作出现的概率”，请问，这种说法正确吗

0 条评论