0
点赞
收藏
分享

微信扫一扫

Logistic Regression Model

Sky飞羽 2022-02-05 阅读 22
机器学习

Logistic Regression Model

Cost Function

Logistic Function will not be a convex function, it will cause the output to be wavy, causing many local optima.

  • Cost function for logistic regression looks like:

在这里插入图片描述

When y=1, J ( θ ) J(\theta) J(θ) vs h θ ( x ) h_\theta(x) hθ(x):

在这里插入图片描述

(这个时候代价值是 − l o g ( h θ ( x ( i ) ) ) -log(h_\theta(x^{(i)})) log(hθ(x(i))),就是lgX在0~1上的图像反过来)

When y=0, J ( θ ) J(\theta) J(θ) vs h θ ( x ) h_\theta(x) hθ(x):
在这里插入图片描述

在这里插入图片描述

Simplified Cost Function and Gradient Descent

For classification problems, y is always equal to 0 or 1

Compress cost function’s two conditional cases into one case:

C o s t ( h θ ( x ) , y ) = − y    l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) Cost(h_\theta(x),y)=-y \;log(h_\theta(x))-(1-y)log(1-h_\theta(x)) Cost(hθ(x),y)=ylog(hθ(x))(1y)log(1hθ(x)) (因为y只有0或1 两个取值)

When y=1, the second term ( 1 − y ) l o g ( 1 − h θ ( x ) ) = 0 (1-y)log(1-h_\theta(x))=0 (1y)log(1hθ(x))=0 and would not affect the result.

If y=0, the first term − y    l o g ( h θ ( x ) ) = 0 -y\;log(h_\theta(x))=0 ylog(hθ(x))=0 and would not affect the result.

  • Entire cost function :

J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ) ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] J(\theta)=-\frac{1}{m}\sum^{m}_{i=1} {[y^{(i)}log(h_\theta)(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]} J(θ)=m1i=1m[y(i)log(hθ)(x(i))+(1y(i))log(1hθ(x(i)))]

  • A vectorized implementation :

    h = g ( X θ ) h=g(X\theta) h=g(Xθ)

    J ( θ ) = 1 m ⋅ ( − y T l o g ( h ) − ( 1 − y ) T l o g ( 1 − h ) ) J(\theta)=\frac{1}{m}\cdot(-y^Tlog(h)-(1-y)^T log(1-h)) J(θ)=m1(yTlog(h)(1y)Tlog(1h))

Minimize the cost function

Gradient Descent

General form of gradient descent:

$ Repeat${

θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj:=θjαθjJ(θ)

}

work out the derivative part using calculus to get:

$ Repeat${

θ j : = θ j − α m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x j ( i ) \theta_j:=\theta_j-\frac{\alpha}{m}\sum^m_{i=1}{(h_\theta(x^{(i)}-y^{(i)})x_j^{(i)}} θj:=θjmαi=1m(hθ(x(i)y(i))xj(i)

}

A vectorized implementation is;

θ : = θ − α m X T ( g ( X θ ) − y → ) \theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y}) θ:=θmαXT(g(Xθ)y )

Advanced Optimization

What gradient descent is doing?

Provide a function that evaluates the following two functions for a given input value θ:

- J ( θ ) J(\theta) J(θ)

- ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) θjJ(θ)

for (j=0,1,…,n)

Gradient Descent repeat{

θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj:=θjαθjJ(θ)

}

and update θ \theta θ

So we can write a single function compute J ( θ ) J(\theta) J(θ) and ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) θjJ(θ) , then use “fminunc()” along with the “optimset()” function

Optimization algorithms

  1. “Gradient Descent”
  2. “Conjugate gradient”
  3. “BFGS”
  4. “L-BFGS”

Advantages of the last three algorithms:

  • No need to manually pick α \alpha α ( have a inner-loop called line search algorithm,自动选择好的学习率 α \alpha α,甚至每次用不同的学习率迭代)
  • Often faster than gradient descent

Disadvantages:

  • More complex

code

function [jVal, gradient] = costFunction(theta)

jVal = [code to compute J(θ)];

gradient(1) = [code to compute ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) θjJ(θ) ];

gradient(2) = [code to compute ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) θjJ(θ) ];

.

.

.

gradient(n+1) = [code to compute ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) θjJ(θ) ];

Multiclass Classification: one vs all

-多元分类问题

Email folding, medical diagrams…

也叫 一对余问题

原理:划分为多个二元分类问题,得到新的“伪”分类集, 分为正类别和负类别

one vs all

Train a logistic regression classifier h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) for each class i i i to predict the probability that y = 1 y=1 y=1.

On a new input x x x, to make a prediction, pick the class i i i that maximizes h θ ( x ) h_\theta(x) hθ(x).(选择可信度最高效果最好的分类器)

s i i i to predict the probability that y = 1 y=1 y=1.

On a new input x x x, to make a prediction, pick the class i i i that maximizes h θ ( x ) h_\theta(x) hθ(x).(选择可信度最高效果最好的分类器)

举报

相关推荐

0 条评论