Logistic Regression Model
Cost Function
Logistic Function will not be a convex function, it will cause the output to be wavy, causing many local optima.
- Cost function for logistic regression looks like:
When y=1, J ( θ ) J(\theta) J(θ) vs h θ ( x ) h_\theta(x) hθ(x):
(这个时候代价值是 − l o g ( h θ ( x ( i ) ) ) -log(h_\theta(x^{(i)})) −log(hθ(x(i))),就是lgX在0~1上的图像反过来)
When y=0,
J
(
θ
)
J(\theta)
J(θ) vs
h
θ
(
x
)
h_\theta(x)
hθ(x):
Simplified Cost Function and Gradient Descent
For classification problems, y is always equal to 0 or 1
Compress cost function’s two conditional cases into one case:
C o s t ( h θ ( x ) , y ) = − y l o g ( h θ ( x ) ) − ( 1 − y ) l o g ( 1 − h θ ( x ) ) Cost(h_\theta(x),y)=-y \;log(h_\theta(x))-(1-y)log(1-h_\theta(x)) Cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x)) (因为y只有0或1 两个取值)
When y=1, the second term ( 1 − y ) l o g ( 1 − h θ ( x ) ) = 0 (1-y)log(1-h_\theta(x))=0 (1−y)log(1−hθ(x))=0 and would not affect the result.
If y=0, the first term − y l o g ( h θ ( x ) ) = 0 -y\;log(h_\theta(x))=0 −ylog(hθ(x))=0 and would not affect the result.
- Entire cost function :
J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) l o g ( h θ ) ( x ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − h θ ( x ( i ) ) ) ] J(\theta)=-\frac{1}{m}\sum^{m}_{i=1} {[y^{(i)}log(h_\theta)(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]} J(θ)=−m1∑i=1m[y(i)log(hθ)(x(i))+(1−y(i))log(1−hθ(x(i)))]
-
A vectorized implementation :
h = g ( X θ ) h=g(X\theta) h=g(Xθ)
J ( θ ) = 1 m ⋅ ( − y T l o g ( h ) − ( 1 − y ) T l o g ( 1 − h ) ) J(\theta)=\frac{1}{m}\cdot(-y^Tlog(h)-(1-y)^T log(1-h)) J(θ)=m1⋅(−yTlog(h)−(1−y)Tlog(1−h))
Minimize the cost function
–Gradient Descent
General form of gradient descent:
$ Repeat${
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj:=θj−α∂θj∂J(θ)
}
work out the derivative part using calculus to get:
$ Repeat${
θ j : = θ j − α m ∑ i = 1 m ( h θ ( x ( i ) − y ( i ) ) x j ( i ) \theta_j:=\theta_j-\frac{\alpha}{m}\sum^m_{i=1}{(h_\theta(x^{(i)}-y^{(i)})x_j^{(i)}} θj:=θj−mα∑i=1m(hθ(x(i)−y(i))xj(i)
}
A vectorized implementation is;
θ : = θ − α m X T ( g ( X θ ) − y → ) \theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y}) θ:=θ−mαXT(g(Xθ)−y)
Advanced Optimization
What gradient descent is doing?
Provide a function that evaluates the following two functions for a given input value θ:
- J ( θ ) J(\theta) J(θ)
- ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) ∂θj∂J(θ)
for (j=0,1,…,n)
Gradient Descent repeat{
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) θj:=θj−α∂θj∂J(θ)
}
and update θ \theta θ
So we can write a single function compute J ( θ ) J(\theta) J(θ) and ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) ∂θj∂J(θ) , then use “fminunc()” along with the “optimset()” function
Optimization algorithms
- “Gradient Descent”
- “Conjugate gradient”
- “BFGS”
- “L-BFGS”
Advantages of the last three algorithms:
- No need to manually pick α \alpha α ( have a inner-loop called line search algorithm,自动选择好的学习率 α \alpha α,甚至每次用不同的学习率迭代)
- Often faster than gradient descent
Disadvantages:
- More complex
code
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(θ)];
gradient(1) = [code to compute ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) ∂θj∂J(θ) ];
gradient(2) = [code to compute ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) ∂θj∂J(θ) ];
.
.
.
gradient(n+1) = [code to compute ∂ ∂ θ j J ( θ ) \frac{\partial}{\partial\theta_j}J(\theta) ∂θj∂J(θ) ];
Multiclass Classification: one vs all
-多元分类问题
Email folding, medical diagrams…
也叫 一对余问题
原理:划分为多个二元分类问题,得到新的“伪”分类集, 分为正类别和负类别
one vs all
Train a logistic regression classifier h θ ( i ) ( x ) h_\theta^{(i)}(x) hθ(i)(x) for each class i i i to predict the probability that y = 1 y=1 y=1.
On a new input x x x, to make a prediction, pick the class i i i that maximizes h θ ( x ) h_\theta(x) hθ(x).(选择可信度最高效果最好的分类器)
s i i i to predict the probability that y = 1 y=1 y=1.
On a new input x x x, to make a prediction, pick the class i i i that maximizes h θ ( x ) h_\theta(x) hθ(x).(选择可信度最高效果最好的分类器)