Logistic Regression Model

Cost Function

Logistic Function will not be a convex function, it will cause the output to be wavy, causing many local optima.

Cost function for logistic regression looks like:

在这里插入图片描述

When y=1, $J(\theta)$ vs $h_\theta(x)$ :

在这里插入图片描述

（这个时候代价值是 $-log(h_\theta(x^{(i)}))$ ,就是lgX在0~1上的图像反过来）

When y=0, $J(\theta)$ vs $h_\theta(x)$ :
在这里插入图片描述

在这里插入图片描述

Simplified Cost Function and Gradient Descent

For classification problems, y is always equal to 0 or 1

Compress cost function’s two conditional cases into one case:

$Cost(h_\theta(x),y)=-y \;log(h_\theta(x))-(1-y)log(1-h_\theta(x))$ （因为y只有0或1 两个取值）

When y=1, the second term $(1-y)log(1-h_\theta(x))=0$ and would not affect the result.

If y=0, the first term $-y\;log(h_\theta(x))=0$ and would not affect the result.

Entire cost function :

$J(\theta)=-\frac{1}{m}\sum^{m}_{i=1} {[y^{(i)}log(h_\theta)(x^{(i)})+(1-y^{(i)})log(1-h_\theta(x^{(i)}))]}$

A vectorized implementation :

$h=g(X\theta)$

$J(\theta)=\frac{1}{m}\cdot(-y^Tlog(h)-(1-y)^T log(1-h))$

Minimize the cost function

–Gradient Descent

General form of gradient descent:

$ Repeat${

$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$

}

work out the derivative part using calculus to get:

$ Repeat${

$\theta_j:=\theta_j-\frac{\alpha}{m}\sum^m_{i=1}{(h_\theta(x^{(i)}-y^{(i)})x_j^{(i)}}$

}

A vectorized implementation is；

$\theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y})$

Advanced Optimization

What gradient descent is doing?

Provide a function that evaluates the following two functions for a given input value θ:

- $J(\theta)$

- $\frac{\partial}{\partial\theta_j}J(\theta)$

for (j=0,1,…,n)

Gradient Descent repeat{

$\theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$

}

and update $\theta$

So we can write a single function compute $J(\theta)$ and $\frac{\partial}{\partial\theta_j}J(\theta)$ , then use “fminunc()” along with the “optimset()” function

Optimization algorithms

“Gradient Descent”
“Conjugate gradient”
“BFGS”
“L-BFGS”

Advantages of the last three algorithms:

No need to manually pick $\alpha$ ( have a inner-loop called line search algorithm，自动选择好的学习率 $\alpha$ ，甚至每次用不同的学习率迭代)
Often faster than gradient descent

Disadvantages:

More complex

code

function [jVal, gradient] = costFunction(theta)

jVal = [code to compute J(θ)];

gradient(1) = [code to compute $\frac{\partial}{\partial\theta_j}J(\theta)$ ];

gradient(2) = [code to compute $\frac{\partial}{\partial\theta_j}J(\theta)$ ];

gradient(n+1) = [code to compute $\frac{\partial}{\partial\theta_j}J(\theta)$ ];

Multiclass Classification: one vs all

-多元分类问题

Email folding, medical diagrams…

也叫一对余问题

原理：划分为多个二元分类问题，得到新的“伪”分类集，分为正类别和负类别

one vs all

Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y = 1$ .

On a new input $x$ , to make a prediction, pick the class $i$ that maximizes $h_\theta(x)$ .(选择可信度最高效果最好的分类器)

s $i$ to predict the probability that $y = 1$ .

On a new input $x$ , to make a prediction, pick the class $i$ that maximizes $h_\theta(x)$ .(选择可信度最高效果最好的分类器)