文章目录
19 贝叶斯线性回归
19.1 频率派线性回归
数据与模型:
-
样本:
{ ( x i , y i ) } i = 1 N , x i ∈ R p , y i ∈ R p {\lbrace (x_i, y_i) \rbrace}_{i=1}^{N}, \quad x_i \in {\mathbb R}^p, \quad y_i \in {\mathbb R}^p {(xi,yi)}i=1N,xi∈Rp,yi∈RpX = ( x 1 x 2 … x N ) T = ( x 1 T x 2 T … x N T ) = ( x 11 x 12 … x 1 N x 21 x 22 … x 2 N … x N 1 x N 2 … x N N ) , Y = ( y 1 T y 2 T … y N T ) X = (x_1 \ x_2 \ \dots \ x_N )^T = \begin{pmatrix} x_1^T \\ x_2^T \\ \dots \\ x_N^T \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1N} \\ x_{21} & x_{22} & \dots & x_{2N} \\ \dots \\ x_{N1} & x_{N2} & \dots & x_{NN} \\ \end{pmatrix} , Y = \begin{pmatrix} y_1^T \\ y_2^T \\ \dots \\ y_N^T \end{pmatrix} X=(x1 x2 … xN)T= x1Tx2T…xNT = x11x21…xN1x12x22xN2………x1Nx2NxNN ,Y= y1Ty2T…yNT
-
回归方程:
f ( x ) = w T x = x T w , y = f ( x ) + ε ⏟ n o i s e , ε ∽ N ( 0 , σ 2 ) f(x) = w^T x = x^T w, \quad y = f(x) + \underbrace{\varepsilon}_{noise}, \quad \varepsilon \backsim N(0,\sigma^2) f(x)=wTx=xTw,y=f(x)+noise ε,ε∽N(0,σ2)
其中 x , y , ε x, y, \varepsilon x,y,ε都是随机变量,假设 w w w用于表示参数
在频率派的线性回归中,我们是通过假设 w w w表示一个未知的常量,转化为优化问题进行求解。我们将这种方法称为点估计,在过去我们学习过了两种方法:
-
L S E ⟸ M L E ( noise is Gaussian ) LSE \impliedby MLE(\text{noise is Gaussian}) LSE⟸MLE(noise is Gaussian)——极大似然估计:
w M L E = a r g max w P ( D a t a ∣ w ) w_{MLE} = arg\max_{w} P(Data|w) wMLE=argwmaxP(Data∣w) -
R e g u l a r i z e d L S E ⟸ M A P ( noise is Gaussian ) Regularized \ LSE \impliedby MAP(\text{noise is Gaussian}) Regularized LSE⟸MAP(noise is Gaussian)——最大后验估计:
w M A P = a r g max w P ( w ∣ D a t a ) ⏟ ∝ P ( D a t a ∣ w ) ⋅ P ( w ) = a r g max w P ( D a t a ∣ w ) ⋅ P ( w ) w_{MAP} = arg\max_{w} \underbrace{P(w|Data)}_{\propto P(Data|w) \cdot P(w)} = arg\max_{w} P(Data|w) \cdot P(w) wMAP=argwmax∝P(Data∣w)⋅P(w) P(w∣Data)=argwmaxP(Data∣w)⋅P(w)
其中若 P ( w ) P(w) P(w)表示为Gaussian Dist则为岭回归(Ridge),若 P ( w ) P(w) P(w)表示为Laplace则为Lasso
在本章我们的目标是通过Bayesian Method解决线性回归问题:
- 假定 w w w是一个随机变量
- 求出后验 P ( w ∣ D a t a ) P(w|Data) P(w∣Data)
19.2 Bayesian Method
数据与模型:
-
样本数据:
{ ( x i , y i ) } i = 1 N , x i ∈ R p , y i ∈ R p {\lbrace (x_i, y_i) \rbrace}_{i=1}^{N}, \quad x_i \in {\mathbb R}^p, \quad y_i \in {\mathbb R}^p {(xi,yi)}i=1N,xi∈Rp,yi∈RpX = ( x 1 x 2 … x N ) T = ( x 1 T x 2 T … x N T ) = ( x 11 x 12 … x 1 N x 21 x 22 … x 2 N … x N 1 x N 2 … x N N ) , Y = ( y 1 T y 2 T … y N T ) X = (x_1 \ x_2 \ \dots \ x_N )^T = \begin{pmatrix} x_1^T \\ x_2^T \\ \dots \\ x_N^T \end{pmatrix} = \begin{pmatrix} x_{11} & x_{12} & \dots & x_{1N} \\ x_{21} & x_{22} & \dots & x_{2N} \\ \dots \\ x_{N1} & x_{N2} & \dots & x_{NN} \\ \end{pmatrix} , Y = \begin{pmatrix} y_1^T \\ y_2^T \\ \dots \\ y_N^T \end{pmatrix} X=(x1 x2 … xN)T= x1Tx2T…xNT = x11x21…xN1x12x22xN2………x1Nx2NxNN ,Y= y1Ty2T…yNT
-
模型:
f ( x ) = w T x = x T w , y = f ( x ) + ε ⏟ n o i s e , ε ∽ N ( 0 , σ 2 ) f(x) = w^T x = x^T w, \quad y = f(x) + \underbrace{\varepsilon}_{noise}, \quad \varepsilon \backsim N(0,\sigma^2) f(x)=wTx=xTw,y=f(x)+noise ε,ε∽N(0,σ2)
其中 x , y , ε , w x, y, \varepsilon, w x,y,ε,w都是随机变量,假设用于表示参数 -
问题表示:
{ I n f e r e n c e : p o s t e r i o r ( w ) P r e d i c t i o n : x ∗ → y ∗ \begin{cases} Inference: posterior(w) \\ Prediction: x^* \rightarrow y^* \end{cases} {Inference:posterior(w)Prediction:x∗→y∗
19.2.1 Inference问题
Inference问题就是求解后验:
P
(
w
∣
D
a
t
a
)
P(w|Data)
P(w∣Data)。接下来进行逐步的推导:
P
(
w
∣
D
a
t
a
)
=
P
(
w
∣
X
,
Y
)
=
P
(
w
,
Y
∣
X
)
P
(
Y
∣
X
)
=
P
(
Y
∣
w
,
X
)
⏞
l
i
k
e
l
i
h
o
o
d
⋅
P
(
w
∣
X
)
⏞
p
r
i
o
r
∫
P
(
Y
∣
w
,
X
)
⋅
P
(
w
∣
X
)
d
w
\begin{align} P(w|Data) = P(w|X, Y) = \frac{P(w, Y| X)}{P(Y|X)} = \frac{\overbrace{P(Y|w, X)}^{likelihood} \cdot \overbrace{P(w|X)}^{prior}}{\int P(Y|w, X) \cdot P(w|X) {\rm d}w} \end{align}
P(w∣Data)=P(w∣X,Y)=P(Y∣X)P(w,Y∣X)=∫P(Y∣w,X)⋅P(w∣X)dwP(Y∣w,X)
likelihood⋅P(w∣X)
prior
将后验拆解开之后,我们只需要分开求解likelihood和prior:
-
求解likelihood:
P ( Y ∣ w , X ) = ∏ i = 1 N P ( y i ∣ w , x i ) = ∏ i = 1 N N ( y i ∣ w T x i , σ 2 ) P(Y|w, X) = \prod_{i=1}^{N} P(y_i| w, x_i) = \prod_{i=1}^{N} N(y_i| w^T x_i, \sigma^2) P(Y∣w,X)=i=1∏NP(yi∣w,xi)=i=1∏NN(yi∣wTxi,σ2) -
假设prior:
p ( w ∣ X ) = N ( 0 , Σ p ) p(w|X) = N(0, \Sigma_p) p(w∣X)=N(0,Σp)
所以求解后验可以写为:
P
(
w
∣
D
a
t
a
)
∝
P
(
Y
∣
w
,
X
)
⋅
P
(
w
∣
X
)
∝
∏
i
=
1
N
N
(
y
i
∣
w
T
x
i
,
σ
2
)
⋅
N
(
0
,
Σ
p
)
\begin{align} P(w|Data) &\propto P(Y|w,X) \cdot P(w|X) \\ &\propto \prod_{i=1}^{N} N(y_i| w^T x_i, \sigma^2) \cdot N(0, \Sigma_p) \end{align}
P(w∣Data)∝P(Y∣w,X)⋅P(w∣X)∝i=1∏NN(yi∣wTxi,σ2)⋅N(0,Σp)
我们先将likelihood进行一个变换:
P
(
Y
∣
w
,
X
)
=
∏
i
=
1
N
N
(
y
i
∣
w
T
x
i
,
σ
2
)
=
∏
i
=
1
N
1
(
2
π
)
1
2
σ
exp
{
−
1
2
σ
2
(
y
i
−
w
T
x
i
)
2
}
=
1
(
2
π
)
N
2
σ
N
exp
{
−
1
2
σ
2
∑
i
=
1
N
(
y
i
−
w
T
x
i
)
2
}
=
1
(
2
π
)
N
2
σ
N
⏟
∣
Σ
∣
1
2
exp
{
−
1
2
(
Y
−
X
w
)
⏟
x
−
μ
T
σ
−
2
I
⏟
Σ
−
1
(
Y
−
X
w
)
}
=
N
(
X
w
,
σ
−
2
I
)
\begin{align} P(Y|w, X) &= \prod_{i=1}^{N} N(y_i| w^T x_i, \sigma^2) \\ &= \prod_{i=1}^{N} \frac{ 1 }{ {(2 \pi)}^\frac{1}{2} \sigma } \exp{\lbrace -\frac{1}{2\sigma^2} {( y_i - w^T x_i )}^2 \rbrace} \\ &= \frac{ 1 }{ {(2 \pi)}^\frac{N}{2} \sigma^N } \exp{\lbrace -\frac{1}{2\sigma^2} \sum_{i=1}^N {( y_i - w^T x_i )}^2 \rbrace} \\ &= \frac{ 1 }{ {(2 \pi)}^\frac{N}{2} \underbrace{\sigma^N}_{{|\Sigma|}^\frac{1}{2}} } \exp{\lbrace -\frac{1}{2} {\underbrace{(Y-Xw)}_{x-\mu}}^T \underbrace{\sigma^{-2} I}_{\Sigma^{-1}} {(Y-Xw)} \rbrace} \\ &= N(Xw, \sigma^{-2} I) \end{align}
P(Y∣w,X)=i=1∏NN(yi∣wTxi,σ2)=i=1∏N(2π)21σ1exp{−2σ21(yi−wTxi)2}=(2π)2NσN1exp{−2σ21i=1∑N(yi−wTxi)2}=(2π)2N∣Σ∣21
σN1exp{−21x−μ
(Y−Xw)TΣ−1
σ−2I(Y−Xw)}=N(Xw,σ−2I)
通过上文的likelihood我们可以求解:
P
(
w
∣
D
a
t
a
)
∝
P
(
Y
∣
w
,
X
)
⋅
P
(
w
∣
X
)
=
N
(
X
w
,
σ
−
2
I
)
)
⋅
N
(
0
,
Σ
p
)
∝
exp
{
−
1
2
(
Y
−
X
w
)
T
σ
−
2
I
(
Y
−
X
w
)
}
⋅
exp
{
−
1
2
w
T
Σ
p
−
1
w
}
=
exp
{
−
1
2
(
Y
−
X
w
)
T
σ
−
2
I
(
Y
−
X
w
)
−
1
2
w
T
Σ
p
−
1
w
}
=
exp
{
−
1
2
(
Y
T
Y
−
2
Y
T
X
w
+
w
T
X
T
X
w
)
−
1
2
w
T
Σ
p
−
1
w
}
\begin{align} P(w|Data) &\propto P(Y|w,X) \cdot P(w|X) = N(Xw, \sigma^{-2} I)) \cdot N(0, \Sigma_p) \\ &\propto \exp{\lbrace -\frac{1}{2} {{(Y-Xw)}}^T {\sigma^{-2} I} {(Y-Xw)} \rbrace} \cdot \exp{\lbrace -\frac{1}{2} w^T \Sigma_p^{-1} w \rbrace} \\ &= \exp{\lbrace -\frac{1}{2} {{(Y-Xw)}}^T {\sigma^{-2} I} {(Y-Xw)} -\frac{1}{2} w^T \Sigma_p^{-1} w \rbrace} \\ &= \exp{\lbrace -\frac{1}{2} {( Y^T Y - 2Y^T X w + w^T X^T X w )} -\frac{1}{2} w^T \Sigma_p^{-1} w \rbrace} \\ \end{align}
P(w∣Data)∝P(Y∣w,X)⋅P(w∣X)=N(Xw,σ−2I))⋅N(0,Σp)∝exp{−21(Y−Xw)Tσ−2I(Y−Xw)}⋅exp{−21wTΣp−1w}=exp{−21(Y−Xw)Tσ−2I(Y−Xw)−21wTΣp−1w}=exp{−21(YTY−2YTXw+wTXTXw)−21wTΣp−1w}
让我们用配方法,取出
P
(
w
∣
D
a
t
a
)
P(w|Data)
P(w∣Data)的二次项和一次项,假设
P
(
w
∣
D
a
t
a
)
P(w|Data)
P(w∣Data)的均值和方差表示为
μ
w
,
Σ
w
\mu_w, \Sigma_w
μw,Σw:
{
二次项:
−
1
2
σ
2
w
T
X
T
X
w
−
1
2
w
T
Σ
p
−
1
w
=
−
1
2
(
w
T
(
σ
−
2
X
T
X
+
Σ
p
−
1
)
w
)
⏟
−
1
2
x
T
Σ
w
−
1
x
一次项:
σ
−
2
Y
T
X
w
⏟
μ
T
Σ
w
−
1
x
⟹
{
Σ
w
−
1
=
(
σ
−
2
X
T
X
+
Σ
p
−
1
)
μ
T
Σ
w
−
1
=
σ
−
2
Y
T
X
\begin{align} &\begin{cases} \text{二次项:} -\frac{1}{2 \sigma^2} w^T X^T X w - \frac{1}{2} w^T \Sigma_p^{-1} w = \underbrace{ -\frac{1}{2} {(w^T {(\sigma^{-2} X^T X + \Sigma_p^{-1})} w)}}_{-\frac{1}{2} x^T \Sigma_w^{-1} x} \\ \text{一次项:} \underbrace{\sigma^{-2} Y^T X w}_{\mu^T \Sigma_w^{-1} x} \end{cases} \\ \implies &\begin{cases} \Sigma_w^{-1} = {(\sigma^{-2} X^T X + \Sigma_p^{-1})} \\ \mu^T \Sigma_w^{-1} = \sigma^{-2} Y^T X \end{cases} \end{align}
⟹⎩
⎨
⎧二次项:−2σ21wTXTXw−21wTΣp−1w=−21xTΣw−1x
−21(wT(σ−2XTX+Σp−1)w)一次项:μTΣw−1x
σ−2YTXw{Σw−1=(σ−2XTX+Σp−1)μTΣw−1=σ−2YTX
通过上文的方程可以简单求解出均值和方差:
{
Σ
w
=
(
σ
−
2
X
T
X
+
Σ
p
−
1
)
−
1
μ
T
=
σ
−
4
X
T
X
Y
T
X
+
σ
−
2
Σ
p
−
1
Y
T
X
\begin{cases} \Sigma_w = {(\sigma^{-2} X^T X + \Sigma_p^{-1})}^{-1} \\ \mu^T = \sigma^{-4} X^T X Y^T X + \sigma^{-2} \Sigma_p^{-1} Y^T X \end{cases}
{Σw=(σ−2XTX+Σp−1)−1μT=σ−4XTXYTX+σ−2Σp−1YTX
19.2.2 Prediction问题
Prediction问题是假设已有数据为 x ∗ x^* x∗,要求在 y ∗ y^* y∗的条件下的概率分布。
我们的条件有:
{
f
(
x
)
=
x
T
w
w
∽
N
(
μ
w
,
Σ
w
)
\begin{cases} f(x) = x^T w \\ w \backsim N(\mu_w, \Sigma_w) \end{cases}
{f(x)=xTww∽N(μw,Σw)
此时我们已知
f
(
x
∗
)
=
x
∗
T
w
f(x^*) = {x^*}^T w
f(x∗)=x∗Tw,可以根据参数的分布得到
P
(
x
∗
T
w
)
P({x^*}^T w)
P(x∗Tw):
w
∽
N
(
μ
w
,
Σ
w
)
⟹
x
∗
T
w
∽
N
(
x
∗
T
μ
w
,
x
∗
T
Σ
w
x
∗
)
\begin{align} & w \backsim N(\mu_w, \Sigma_w) \\ \implies & {x^*}^T w \backsim N({x^*}^T \mu_w, {x^*}^T \Sigma_w x^*) \end{align}
⟹w∽N(μw,Σw)x∗Tw∽N(x∗Tμw,x∗TΣwx∗)
实际情况是我们要求解
y
=
f
(
x
∗
)
+
ε
,
ε
∽
N
(
0
,
σ
2
)
y = f(x^*) + \varepsilon, \quad \varepsilon \backsim N(0, \sigma^2)
y=f(x∗)+ε,ε∽N(0,σ2),也就是求解分布
P
(
y
∗
∣
D
a
t
a
,
x
∗
)
P(y^*| Data, x^*)
P(y∗∣Data,x∗):
{
y
=
x
∗
T
w
+
ε
,
ε
∽
N
(
0
,
σ
2
)
x
∗
T
w
∽
N
(
x
∗
T
μ
w
,
x
∗
T
Σ
w
x
∗
)
⟹
P
(
y
∗
∣
D
a
t
a
,
x
∗
)
=
N
(
x
∗
T
μ
w
,
x
∗
T
Σ
w
x
∗
+
σ
2
)
\begin{align} &\begin{cases} y = {x^*}^T w + \varepsilon, \quad \varepsilon \backsim N(0, \sigma^2) \\ {x^*}^T w \backsim N({x^*}^T \mu_w, {x^*}^T \Sigma_w x^*) \end{cases} \\ \implies & P(y^*|Data, x^*) = N({x^*}^T \mu_w, {x^*}^T \Sigma_w x^* + \sigma^2) \end{align}
⟹{y=x∗Tw+ε,ε∽N(0,σ2)x∗Tw∽N(x∗Tμw,x∗TΣwx∗)P(y∗∣Data,x∗)=N(x∗Tμw,x∗TΣwx∗+σ2)