0
点赞
收藏
分享

微信扫一扫

EM算法(expectation maximization algorithm)含有隐变量的概率模型参数的极大似然估计方法

Sky飞羽 2022-02-14 阅读 83

引言

概率模型的目的是最大化标签在特征条件下的概率分布 P ( y ∣ X ; θ ) P(y | X; \theta) P(yX;θ)。一般来讲,我们可根据给定样本的标签 y y y 和特征 X X X 数据,直接使用极大似然估计法或贝叶斯估计法来估计模型参数 θ \theta θ

但当标签 y y y 是不可观测的隐变量(hidden variable)时,极大似然估计法或贝叶斯估计法失效,需要使用期望极大算法expectation maximization algorithm, EM)对模型参数进行极大似然估计。

EM算法推导

一个含有隐变量的概率模型,目标变为是最大化观测数据(不完全数据) X X X 关于模型参数 的对数似然函数:

θ ^ = arg max ⁡ θ L ( θ ) \hat{\theta} = \argmax_{\theta} L(\theta) θ^=θargmaxL(θ) L ( θ ) = l o g [ P ( X ∣ θ ) ] = l o g [ ∑ Y P ( X , Y ∣ θ ) ] = l o g [ ∑ Y P ( X ∣ Y , θ ) P ( Y ∣ θ ) ] L(\theta) = log[P(X | \theta)] = log\begin{bmatrix} \sum_{Y} P(X, Y | \theta) \end{bmatrix} = log\begin{bmatrix} \sum_{Y} P(X | Y, \theta)P(Y|\theta) \end{bmatrix} L(θ)=log[P(Xθ)]=log[YP(X,Yθ)]=log[YP(XY,θ)P(Yθ)]
EM算法是通过迭代来逐步近似极大化 L ( θ ) L(\theta) L(θ) 的,假设在第 i i i 次迭代后 θ \theta θ 的估计值是 θ ( i ) \theta^{(i)} θ(i),所以前后两次迭代对数似然函数的差值 Δ L ( θ ) \Delta L(\theta) ΔL(θ) 等于:

Δ L ( θ ) = L ( θ ) − L ( θ ( i ) ) = l o g [ ∑ Y P ( X ∣ Y , θ ) P ( Y ∣ θ ) ] − l o g P ( X ∣ θ ( i ) ) \Delta L(\theta) = L(\theta) - L(\theta^{(i)}) = log\begin{bmatrix} \sum_{Y} P(X | Y, \theta)P(Y|\theta) \end{bmatrix} - logP(X | \theta^{(i)}) ΔL(θ)=L(θ)L(θ(i))=log[YP(XY,θ)P(Yθ)]logP(Xθ(i))

利用Jensen不等式1得到 Δ L ( θ ) \Delta L(\theta) ΔL(θ) 的下界:

L ( θ ) − L ( θ ( i ) ) = l o g [ ∑ Y P ( Y ∣ X , θ ( i ) ) P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) ] − l o g P ( X ∣ θ ( i ) ) L(\theta) - L(\theta^{(i)}) = log[ \sum_{Y} P(Y|X, \theta^{(i)}) \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})} ] - logP(X | \theta^{(i)}) L(θ)L(θ(i))=log[YP(YX,θ(i))P(YX,θ(i))P(XY,θ)P(Yθ)]logP(Xθ(i)) L ( θ ) − L ( θ ( i ) ) ≥ ∑ Y [ P ( Y ∣ X , θ ( i ) ) l o g ( P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) ) ] − l o g P ( X ∣ θ ( i ) ) L(\theta) - L(\theta^{(i)}) \ge \sum_{Y}[ P(Y|X, \theta^{(i)}) log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})} ) ] - logP(X | \theta^{(i)}) L(θ)L(θ(i))Y[P(YX,θ(i))log(P(YX,θ(i))P(XY,θ)P(Yθ))]logP(Xθ(i))
式中

∑ Y [ P ( Y ∣ X , θ ( i ) ) l o g ( P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) ) ] − l o g P ( X ∣ θ ( i ) ) = ∑ Y [ P ( Y ∣ X , θ ( i ) ) l o g ( P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) P ( X ∣ θ ( i ) ) ) ] \sum_{Y}[ P(Y|X, \theta^{(i)}) log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})} ) ] - logP(X | \theta^{(i)}) = \sum_{Y}[ P(Y|X, \theta^{(i)}) log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} ) ] Y[P(YX,θ(i))log(P(YX,θ(i))P(XY,θ)P(Yθ))]logP(Xθ(i))=Y[P(YX,θ(i))log(P(YX,θ(i))P(Xθ(i))P(XY,θ)P(Yθ))]
所以
L ( θ ) − L ( θ ( i ) ) ≥ ∑ Y [ P ( Y ∣ X , θ ( i ) ) l o g ( P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) P ( X ∣ θ ( i ) ) ) ] L(\theta) - L(\theta^{(i)}) \ge \sum_{Y}[ P(Y|X, \theta^{(i)}) log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} )] L(θ)L(θ(i))Y[P(YX,θ(i))log(P(YX,θ(i))P(Xθ(i))P(XY,θ)P(Yθ))]


B ( θ , θ ( i ) ) = L ( θ ( i ) ) + ∑ Y [ P ( Y ∣ X , θ ( i ) ) l o g ( P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) P ( X ∣ θ ( i ) ) ) ] B(\theta, \theta^{(i)}) = L(\theta^{(i)}) + \sum_{Y}[ P(Y|X, \theta^{(i)}) log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} )] B(θ,θ(i))=L(θ(i))+Y[P(YX,θ(i))log(P(YX,θ(i))P(Xθ(i))P(XY,θ)P(Yθ))]


L ( θ ) ≥ B ( θ , θ ( i ) ) L(\theta) \ge B(\theta, \theta^{(i)}) L(θ)B(θ,θ(i))

因此,任何可以使 B ( θ , θ ( i ) ) B(\theta, \theta^{(i)}) B(θ,θ(i)) 增大的 θ \theta θ 也可以使 L ( θ ) L(\theta) L(θ) 增大,所以模型参数的估计可变换为:
θ ^ = θ ( i + 1 ) = arg max ⁡ θ B ( θ , θ ( i ) ) \hat{\theta} = \theta^{(i+1)} = \argmax_{\theta} B(\theta, \theta^{(i)}) θ^=θ(i+1)=θargmaxB(θ,θ(i)) θ ^ = θ ( i + 1 ) = arg max ⁡ θ L ( θ ( i ) ) + ∑ Y [ P ( Y ∣ X , θ ( i ) ) l o g ( P ( X ∣ Y , θ ) P ( Y ∣ θ ) P ( Y ∣ X , θ ( i ) ) P ( X ∣ θ ( i ) ) ) ] \hat{\theta} = \theta^{(i+1)} = \argmax_{\theta} L(\theta^{(i)}) + \sum_{Y}[ P(Y|X, \theta^{(i)}) log( \frac{P(X | Y, \theta)P(Y|\theta)}{P(Y|X, \theta^{(i)})P(X | \theta^{(i)})} )] θ^=θ(i+1)=θargmaxL(θ(i))+Y[P(YX,θ(i))log(P(YX,θ(i))P(Xθ(i))P(XY,θ)P(Yθ))]可见上式十分复杂,我们不妨省去一些对参数估计没有影响的常数项,以简化 B ( θ , θ ( i ) ) B(\theta, \theta^{(i)}) B(θ,θ(i)) 表达式。由此模型参数的极大似然估计可简写成如下所示:

θ ^ = θ ( i + 1 ) = ∑ Y { P ( Y ∣ X , θ ( i ) ) l o g [ P ( X ∣ Y , θ ) P ( Y ∣ θ ) } \hat{\theta} = \theta^{(i+1)} = \sum_{Y}\{ P(Y|X, \theta^{(i)}) log[ P(X | Y, \theta)P(Y|\theta) \} θ^=θ(i+1)=Y{P(YX,θ(i))log[P(XY,θ)P(Yθ)} θ ^ = θ ( i + 1 ) = ∑ Y [ P ( Y ∣ X , θ ( i ) ) ⋅ l o g P ( X , Y ∣ θ ) ] \hat{\theta} = \theta^{(i+1)} = \sum_{Y}[P(Y|X, \theta^{(i)}) \cdot logP(X, Y | \theta)] θ^=θ(i+1)=Y[P(YX,θ(i))logP(X,Yθ)]
式中 ∑ Y [ P ( Y ∣ X , θ ( i ) ) ⋅ l o g P ( X , Y ∣ θ ) ] \sum_{Y}[P(Y|X, \theta^{(i)}) \cdot logP(X, Y | \theta)] Y[P(YX,θ(i))logP(X,Yθ)] 可定义为 Q Q Q函数,它表示完全数据的对数似然函数 l o g P ( X , Y ∣ θ ) logP(X, Y | \theta) logP(X,Yθ) 在给定观测数据 X X X 和当前参数 θ ( i ) \theta^{(i)} θ(i) 下对不可观测数据 Y Y Y 的条件概率分布 P ( Y ∣ X , θ ( i ) ) P(Y|X, \theta^{(i)}) P(YX,θ(i)) 的期望 E Y [ l o g P ( X , Y ∣ θ ) ∣ X , θ ( i ) ] E_{Y}[logP(X, Y | \theta) | X, \theta^{(i)}] EY[logP(X,Yθ)X,θ(i)]。所以:

Q ( θ , θ ( i ) ) = E Y [ l o g P ( X , Y ∣ θ ) ∣ X , θ ( i ) ] = ∑ Y [ P ( Y ∣ X , θ ( i ) ) ⋅ l o g P ( X , Y ∣ θ ) ] Q(\theta, \theta^{(i)}) = E_{Y}[logP(X, Y | \theta) | X, \theta^{(i)}] = \sum_{Y}[P(Y|X, \theta^{(i)}) \cdot logP(X, Y | \theta)] Q(θ,θ(i))=EY[logP(X,Yθ)X,θ(i)]=Y[P(YX,θ(i))logP(X,Yθ)]


  1. l o g [ ∑ j λ j y j ] ≥ ∑ j λ j l o g ( y j ) ,     其 中 要 求    λ j ≥ 0   且   ∑ j λ j = 1 log\begin{bmatrix} \sum_{j} \lambda_{j}y_{j} \end{bmatrix} \ge \sum_{j} \lambda_{j}log(y_{j}), \ \ \ 其中要求\ \ \lambda_{j} \ge 0 \ 且\ \sum_{j} \lambda_{j}=1 log[jλjyj]jλjlog(yj),     λj0  jλj=1 ↩︎

举报

相关推荐

0 条评论