我继续在学习《ML Lecture 23-1: Deep Reinforcement Learning by Hung-yi Lee》中的视频教程https://youtu.be/W8XF3ME8G2I?si=zEQ3qj_iXzZZ-n85,其中提到:
“”"
Gradient Ascent
θ
new
←
θ
old
+
η
∇
R
ˉ
θ
old
=
∑
t
=
1
T
∇
log
p
(
a
t
∣
s
t
,
θ
)
∇
R
ˉ
θ
≈
1
N
∑
n
=
1
N
R
(
τ
n
)
∇
log
P
(
τ
n
∣
θ
)
=
1
N
∑
n
=
1
N
R
(
τ
n
)
∑
t
=
1
T
n
∇
log
p
(
a
t
n
∣
s
t
n
,
θ
)
=
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
o
n
)
∇
log
‾
p
(
a
t
n
∣
s
t
n
,
θ
)
\begin{aligned} & \begin{array}{l} \text { Gradient Ascent } \\ \theta^{\text {new }} \leftarrow \theta^{\text {old }}+\eta \nabla \bar{R}_{\theta^{\text {old }}} \end{array} \quad=\sum_{t=1}^T \nabla \log p\left(a_t \mid s_t, \theta\right) \\ & \nabla \bar{R}_\theta \approx \frac{1}{N} \sum_{n=1}^N R\left(\tau^n\right) \nabla \log P\left(\tau^n \mid \theta\right)=\frac{1}{N} \sum_{n=1}^N R\left(\tau^n\right) \sum_{t=1}^{T_n} \nabla \log p\left(a_t^n \mid s_t^n, \theta\right) \\ & =\frac{1}{N} \sum_{n=1}^N \sum_{t=1}^{T_n} R\left(\tau_o^n\right) \nabla \underline{\log } p\left(a_t^n \mid s_t^n, \theta\right) \\ & \end{aligned}
Gradient Ascent θnew ←θold +η∇Rˉθold =t=1∑T∇logp(at∣st,θ)∇Rˉθ≈N1n=1∑NR(τn)∇logP(τn∣θ)=N1n=1∑NR(τn)t=1∑Tn∇logp(atn∣stn,θ)=N1n=1∑Nt=1∑TnR(τon)∇logp(atn∣stn,θ)
“”"
“这里的 Gradient Ascent 的微分是很符合人类直觉的,
R
(
τ
n
)
R\left(\tau^n\right)
R(τn)为正则会提升获得此次胜利的过程中采取的每一次动作的概率;而
R
(
τ
n
)
R\left(\tau^n\right)
R(τn)为负,则会降低这些动作出现的概率”,请问,这种说法正确吗