文章目录
Terminology
asymptotically: 渐近地
resurgence: 再起,复活
auditory cortex: 听觉皮层
somatosensory cortex: 体感皮层
jam-packed: 塞满的
dendrites and axon: 树突和轴突
bias unit: 偏置单元
forward propagation: 前向传播
XOR: 异或
XNOR: 同或
negation: 反面,对立面;否定;拒绝
Neural Networks: Representation
Motivations
Non-linear Hypotheses
the dimension of feature size is too large
有10000个像素点,最后结果取 C 10000 2 C_{10000}^2 C100002, 1000 0 2 / 2 = 5 × 1 0 7 10000^2 / 2 = 5\times10^7 100002/2=5×107
Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted to create a hypothesis from three (3) features that included all the quadratic terms:
g
(
θ
0
+
θ
1
x
1
2
+
θ
2
x
1
x
2
+
θ
3
x
1
x
3
+
θ
4
x
2
2
+
θ
5
x
2
x
3
+
θ
6
x
3
2
)
g(θ_0+θ_1x^2_1+θ_2x_1x_2+θ_3x_1x_3+θ_4x_2^2+θ_5x_2x_3+θ_6x^2_3)
g(θ0+θ1x12+θ2x1x2+θ3x1x3+θ4x22+θ5x2x3+θ6x32)
That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the combination function with repetition: http://www.mathsisfun.com/combinatorics/combinations-permutations.html
(
n
+
r
−
1
)
!
r
!
(
n
−
1
)
!
\frac{(n+r-1)!}{r!(n-1)!}
r!(n−1)!(n+r−1)!. In this case we are taking all two-element combinations of three features:
(
3
+
2
−
1
)
!
(
2
!
⋅
(
3
−
1
)
!
)
=
6.
\frac{(3 + 2 - 1)!}{(2!\cdot (3-1)!)}=6.
(2!⋅(3−1)!)(3+2−1)!=6.
For 100 features, if we wanted to make them quadratic we would get ( 100 + 2 − 1 ) ! ( 2 ⋅ ( 100 − 1 ) ! ) = 5050 \frac{(100 + 2 - 1)!}{(2\cdot (100-1)!)} = 5050 (2⋅(100−1)!)(100+2−1)!=5050 resulting new features.
We can approximate the growth of the number of new features we get with all quadratic terms with O ( n 2 / 2 ) \mathcal{O}(n^2/2) O(n2/2). And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at O ( n 3 ) \mathcal{O}(n^3) O(n3). These are very steep growths, so as the number of our features increase, the number of quadratic or cubic features increase very rapidly and becomes quickly impractical.
Example: let our training set be a collection of 50 x 50 pixel black-and-white photographs, and our goal will be to classify which ones are photos of cars. Our feature set size is then n = 2500 if we compare every pair of pixels.
Now let’s say we need to make a quadratic hypothesis function. With quadratic features, our growth is O ( n 2 / 2 ) \mathcal{O}(n^2/2) O(n2/2). So our total features will be about 250 0 2 / 2 = 3125000 2500^2 / 2 = 3125000 25002/2=3125000, which is very impractical.
Neural networks offers an alternate way to perform machine learning when we have complex hypotheses with many features.
Neurons and the Brain
Neural Networks
Model Representation
Neuron in the brain
Neuron model: Logistic unit
bias unit: x 0 x_0 x0 偏置项,视情况而定
activation function: 激励函数
weights: 权重,等同于之前所谈的parameters,即 θ \theta θ
At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called “spikes”) that are channeled to outputs (axons). In our model, our dendrites are like the input features x 1 ⋯ x n x_1\cdots x_n x1⋯xn, and the output is the result of our hypothesis function. In this model our x 0 x_0 x0 input node is sometimes called the “bias unit.” It is always equal to 1. In neural networks, we use the same logistic function as in classification, 1 1 + e − θ T x \frac{1}{1 + e^{-\theta^Tx}} 1+e−θTx1, yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.
Neural Network
input layer
hidden layer
output layer
computations
The values for each of the “activation” nodes is obtained as follows:
a
1
(
2
)
=
g
(
Θ
10
(
1
)
x
0
+
Θ
11
(
1
)
x
1
+
Θ
12
(
1
)
x
2
+
Θ
13
(
1
)
x
3
)
a
2
(
2
)
=
g
(
Θ
20
(
1
)
x
0
+
Θ
21
(
1
)
x
1
+
Θ
22
(
1
)
x
2
+
Θ
23
(
1
)
x
3
)
a
3
(
2
)
=
g
(
Θ
30
(
1
)
x
0
+
Θ
31
(
1
)
x
1
+
Θ
32
(
1
)
x
2
+
Θ
33
(
1
)
x
3
)
h
Θ
(
x
)
=
a
1
(
3
)
=
g
(
Θ
10
(
2
)
a
0
2
+
Θ
11
(
2
)
a
1
2
+
Θ
12
(
2
)
a
2
2
+
Θ
13
(
2
)
a
3
2
)
a^{(2)}_1=g(Θ^{(1)}_{10}x_0+Θ^{(1)}_{11}x_1+Θ^{(1)}_{12}x_2+Θ^{(1)}_{13}x_3)\\ a^{(2)}_2=g(Θ^{(1)}_{20}x_0+Θ^{(1)}_{21}x_1+Θ^{(1)}_{22}x_2+Θ^{(1)}_{23}x_3)\\ a^{(2)}_3=g(Θ^{(1)}_{30}x_0+Θ^{(1)}_{31}x_1+Θ^{(1)}_{32}x_2+Θ^{(1)}_{33}x_3)\\ h_Θ(x)=a^{(3)}_1=g(Θ^{(2)}_{10}a_0^{2}+Θ^{(2)}_{11}a_1^{2}+Θ^{(2)}_{12}a_2^{2}+Θ^{(2)}_{13}a_3^{2})
a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3)a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2+Θ23(1)x3)a3(2)=g(Θ30(1)x0+Θ31(1)x1+Θ32(1)x2+Θ33(1)x3)hΘ(x)=a1(3)=g(Θ10(2)a02+Θ11(2)a12+Θ12(2)a22+Θ13(2)a32)
Θ
(
1
)
\Theta^{(1)}
Θ(1) 是layer1的权重矩阵,它是一个3x4的矩阵(3行4列)。每一行权值乘上输入的值,就能得到一个激活借点的值。
此处, h Θ ( x ) h_\Theta(x) hΘ(x)是计算激活节点值之和的Logistic Function,该函数已乘以另一个参数矩阵 Θ ( 2 ) \Theta^{(2)} Θ(2) ,包含第二层节点的权重。
每层都有该层对应的权重矩阵,这些权重矩阵的维度定义如下:
If network has s j s_j sj units in layer j j j and s j + 1 s_{j+1} sj+1 units in layer j + 1 j+1 j+1, then Θ ( j ) \Theta^{(j)} Θ(j) will be of dimension s j + 1 × ( s j + 1 ) s_{j+1}×(s_j+1) sj+1×(sj+1).
例如下图:layer1有2个输入节点,layer2有4个激活节点,那么 Θ ( 1 ) \Theta^{(1)} Θ(1)的尺寸为4x3。因为 x 0 x_0 x0作为偏置项也得加上,所以要进行4行3列的映射。
Forward propagation: Vectorized implementation
Element-wise
element-wise 是两个张量之间的操作,它在相应张量内的对应的元素进行操作。
例如
v
v
v表示对每一个输入向量
v
⨀
w
=
s
v\bigodot w =s
v⨀w=s 乘以一个给定的”权重"
w
w
w向量。换句话说,就是通过一个乘子对数据集的每一列进行缩放。这个转换可以表示为如下的形式:
{
v
1
,
v
2
,
v
3
}
T
⨀
{
w
1
,
w
2
,
w
3
}
T
=
{
v
1
w
1
,
v
2
w
2
,
v
3
w
3
}
T
\{v_1,v_2,v_3 \}^{T}\bigodot \{w_1,w_2,w_3 \}^{T}=\{v_1w_1,v_2w_2,v_3w_3 \}^{T}
{v1,v2,v3}T⨀{w1,w2,w3}T={v1w1,v2w2,v3w3}T
简而言之,同位元素对应相乘/向量的分量对应分别相乘。
Vectorized implementation
把 g ( Θ 10 ( 1 ) x 0 + Θ 11 ( 1 ) x 1 + Θ 12 ( 1 ) x 2 + Θ 13 ( 1 ) x 3 ) g(Θ^{(1)}_{10}x_0+Θ^{(1)}_{11}x_1+Θ^{(1)}_{12}x_2+Θ^{(1)}_{13}x_3) g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2+Θ13(1)x3) 括号里的写成 z 1 ( 2 ) z_1^{(2)} z1(2),上标2代表layer2。
z ( 2 ) z^{(2)} z(2) 可以表示为权重矩阵 Θ ( 1 ) \Theta_{(1)} Θ(1)乘 x x x,这里的x为了统一形式可以记作layer1激活节点 a ( 1 ) a^{(1)} a(1)。计算对应的 a ( 2 ) a^{(2)} a(2)即为 g ( z ( 2 ) ) g(z^{(2)}) g(z(2)),这里计算是elememt-wise的。
layer2增加一个偏置项 a 0 ( 2 ) = 1 a_0^{(2)}=1 a0(2)=1, z ( 3 ) z^{(3)} z(3)可以表示为权重矩阵 Θ ( 2 ) \Theta^{(2)} Θ(2)乘矩阵 a ( 2 ) a^{(2)} a(2)。故最后的 h Θ ( x ) = a ( 3 ) = g ( z ( 3 ) ) h_\Theta(x)=a^{(3)}=g(z^{(3)}) hΘ(x)=a(3)=g(z(3))
In this section we’ll do a vectorized implementation of the above functions. We’re going to define a new variable
z
k
(
j
)
z_k^{(j)}
zk(j) that encompasses the parameters inside our g function. In our previous example if we replaced by the variable z for all the parameters we would get:
a
1
(
2
)
=
g
(
z
1
(
2
)
)
a
2
(
2
)
=
g
(
z
2
(
2
)
)
a
3
(
2
)
=
g
(
z
3
(
2
)
)
a^{(2)}_1=g(z^{(2)}_1)\\ a^{(2)}_2=g(z^{(2)}_2)\\ a^{(2)}_3=g(z^{(2)}_3)
a1(2)=g(z1(2))a2(2)=g(z2(2))a3(2)=g(z3(2))
In other words, for layer j=2 and node k, the variable z will be:
z
k
(
2
)
=
Θ
k
,
0
(
1
)
x
0
+
Θ
k
,
1
(
1
)
x
1
+
⋯
+
Θ
k
,
n
(
1
)
x
n
z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n
zk(2)=Θk,0(1)x0+Θk,1(1)x1+⋯+Θk,n(1)xn
The vector representation of x and
z
j
z^{j}
zj is:
x
=
[
x
0
x
1
⋯
x
n
]
x=\left[\begin{matrix} x_0\\ x_1\\ \cdots\\ x_n \end{matrix}\right]
x=⎣⎢⎢⎡x0x1⋯xn⎦⎥⎥⎤
z
(
j
)
=
[
z
1
(
j
)
z
2
(
j
)
⋯
z
n
(
j
)
]
z^{(j)}=\left[\begin{matrix} z_1^{(j)}\\ z_2^{(j)}\\ \cdots\\ z_n^{(j)} \end{matrix}\right]
z(j)=⎣⎢⎢⎢⎡z1(j)z2(j)⋯zn(j)⎦⎥⎥⎥⎤
Setting
x
=
a
(
1
)
x = a^{(1)}
x=a(1), we can rewrite the equation as:
z
(
j
)
=
Θ
(
j
−
1
)
a
(
j
−
1
)
z^{(j)} = \Theta^{(j-1)}a^{(j-1)}
z(j)=Θ(j−1)a(j−1)
We are multiplying our matrix
Θ
(
j
−
1
)
\Theta^{(j-1)}
Θ(j−1) with dimensions
s
j
×
(
n
+
1
)
s_j\times (n+1)
sj×(n+1) (where
s
j
s_j
sj is the number of our activation nodes) by our vector
a
(
j
−
1
)
a^{(j-1)}
a(j−1) with height (n+1). This gives us our vector
z
(
j
)
z^{(j)}
z(j) with height
s
j
s_j
sj. Now we can get a vector of our activation nodes for layer j as follows:
a
(
j
)
=
g
(
z
(
j
)
)
a^{(j)} = g(z^{(j)})
a(j)=g(z(j))
Where our function g can be applied element-wise to our vector
z
(
j
)
z^{(j)}
z(j).
We can then add a bias unit (equal to 1) to layer j after we have computed
a
(
j
)
a^{(j)}
a(j). This will be element
a
0
(
j
)
a_0^{(j)}
a0(j) and will be equal to 1. To compute our final hypothesis, let’s first compute another z vector:
z
(
j
+
1
)
=
Θ
(
j
)
a
(
j
)
z^{(j+1)} = \Theta^{(j)}a^{(j)}
z(j+1)=Θ(j)a(j)
We get this final z vector by multiplying the next theta matrix after
Θ
(
j
−
1
)
\Theta^{(j-1)}
Θ(j−1) with the values of all the activation nodes we just got. This last theta matrix
Θ
(
j
)
\Theta^{(j)}
Θ(j) will have only one row which is multiplied by one column
a
(
j
)
a^{(j)}
a(j) so that our result is a single number. We then get our final result with:
h
Θ
(
x
)
=
a
(
j
+
1
)
=
g
(
z
(
j
+
1
)
)
h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})
hΘ(x)=a(j+1)=g(z(j+1))
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.
Neural Network learning its own features
注意公式的蓝色部分,这看起来就像logistic regression,只是 θ \theta θ 变成了 Θ \Theta Θ 。而神经网络的独到之处,在于这里的 a ( 2 ) a^{(2)} a(2)不是固定输入的,而是由固定输入的 x 1 , x 2 , x 3 x_1,x_2,x_3 x1,x2,x3学习而来的。
Other network architectures
You can have neural networks with other types of diagrams as well, and the way that neural networks are connected, that’s called the architecture.
Applications
Examples and Intuitions I
Non-linear classification example: XOR/XNOR
XOR:异或,x1和x2不同,y=1;否则y=0
XNOR:同或, x1和x2相同,y=1(上图中红叉);否则y=0(上图中红圈)
XNOR 等价于 NOT XOR
Simple example: AND
Example: OR function
Examples and Intuitions II
Negation(NOT)
Putting it together: x1 XNOR x2
hidden layer的复杂组合,能够使得神经网络计算复杂的非线性函数
Multiclass Classification
Multiple output units: One-vs-all
So this is just like the “one versus all” method that we talked about when we were describing logistic regression, and here we have essentially four logistic regression classifiers, each of which is trying to recognize one of the four classes that we want to distinguish amongst.
以前的lable y都是写成数值1,2,3,4这样的形式,现在换成矩阵[1,0,0,0]这样的。 x ( i ) x^{(i)} x(i)表示输入进去的图片, y ( i ) y^{(i)} y(i)用矩阵表示相应的行人/汽车/摩托车/卡车
我们希望找到一种方法,让神经网络输出这样的值。