文章目录

Terminology
Neural Networks: Representation

Terminology

asymptotically: 渐近地

resurgence: 再起,复活

auditory cortex: 听觉皮层

somatosensory cortex: 体感皮层

jam-packed: 塞满的

dendrites and axon: 树突和轴突

bias unit: 偏置单元

forward propagation: 前向传播

XOR: 异或

XNOR: 同或

negation: 反面,对立面;否定;拒绝

Neural Networks: Representation

Motivations

Non-linear Hypotheses

the dimension of feature size is too large

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rCH5d93z-1651490803957)(G:\AI学习\ML\Week4.assets\image-20220428164827800.png)]$

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-E2eQl1UM-1651490803958)(G:\AI学习\ML\Week4.assets\image-20220428164847803.png)]$

有10000个像素点，最后结果取 $C_{10000}^2$ ， $10000^2 / 2 = 5\times10^7$

Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted to create a hypothesis from three (3) features that included all the quadratic terms:
$g(θ_0+θ_1x^2_1+θ_2x_1x_2+θ_3x_1x_3+θ_4x_2^2+θ_5x_2x_3+θ_6x^2_3)$
That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the combination function with repetition: http://www.mathsisfun.com/combinatorics/combinations-permutations.html $\frac{(n+r-1)!}{r!(n-1)!}$ . In this case we are taking all two-element combinations of three features: $\frac{(3 + 2 - 1)!}{(2!\cdot (3-1)!)}=6.$

For 100 features, if we wanted to make them quadratic we would get $\frac{(100 + 2 - 1)!}{(2\cdot (100-1)!)} = 5050$ resulting new features.

We can approximate the growth of the number of new features we get with all quadratic terms with $\mathcal{O}(n^2/2)$ . And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at $\mathcal{O}(n^3)$ . These are very steep growths, so as the number of our features increase, the number of quadratic or cubic features increase very rapidly and becomes quickly impractical.

Example: let our training set be a collection of 50 x 50 pixel black-and-white photographs, and our goal will be to classify which ones are photos of cars. Our feature set size is then n = 2500 if we compare every pair of pixels.

Now let’s say we need to make a quadratic hypothesis function. With quadratic features, our growth is $\mathcal{O}(n^2/2)$ . So our total features will be about $2500^2 / 2 = 3125000$ , which is very impractical.

Neural networks offers an alternate way to perform machine learning when we have complex hypotheses with many features.

Neurons and the Brain

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ameFmFkl-1651490803958)(G:\AI学习\ML\Week4.assets\image-20220428213450295.png)]$

Neural Networks

Model Representation

Neuron in the brain

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B5Z60bCR-1651490803958)(G:\AI学习\ML\Week4.assets\image-20220430151410851.png)]$

Neuron model: Logistic unit

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4wtZ936k-1651490803959)(G:\AI学习\ML\Week4.assets\image-20220430151756682.png)]$

bias unit: $x_0$ 偏置项，视情况而定

activation function: 激励函数

weights: 权重，等同于之前所谈的parameters，即 $\theta$

At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called “spikes”) that are channeled to outputs (axons). In our model, our dendrites are like the input features $x_1\cdots x_n$ , and the output is the result of our hypothesis function. In this model our $x_0$ input node is sometimes called the “bias unit.” It is always equal to 1. In neural networks, we use the same logistic function as in classification, $\frac{1}{1 + e^{-\theta^Tx}}$ , yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our “theta” parameters are sometimes called “weights”.

Neural Network

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f82xzfQn-1651490803959)(G:\AI学习\ML\Week4.assets\image-20220430152139441.png)]$

input layer

hidden layer

output layer

computations

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8m7x4ycV-1651490803959)(G:\AI学习\ML\Week4.assets\image-20220430154502507.png)]$

The values for each of the “activation” nodes is obtained as follows:
$a^{(2)}_1=g(Θ^{(1)}_{10}x_0+Θ^{(1)}_{11}x_1+Θ^{(1)}_{12}x_2+Θ^{(1)}_{13}x_3)\\ a^{(2)}_2=g(Θ^{(1)}_{20}x_0+Θ^{(1)}_{21}x_1+Θ^{(1)}_{22}x_2+Θ^{(1)}_{23}x_3)\\ a^{(2)}_3=g(Θ^{(1)}_{30}x_0+Θ^{(1)}_{31}x_1+Θ^{(1)}_{32}x_2+Θ^{(1)}_{33}x_3)\\ h_Θ(x)=a^{(3)}_1=g(Θ^{(2)}_{10}a_0^{2}+Θ^{(2)}_{11}a_1^{2}+Θ^{(2)}_{12}a_2^{2}+Θ^{(2)}_{13}a_3^{2})$
$\Theta^{(1)}$ 是layer1的权重矩阵，它是一个3x4的矩阵（3行4列）。每一行权值乘上输入的值，就能得到一个激活借点的值。

此处， $h_\Theta(x)$ 是计算激活节点值之和的Logistic Function，该函数已乘以另一个参数矩阵 $\Theta^{(2)}$ ，包含第二层节点的权重。

每层都有该层对应的权重矩阵，这些权重矩阵的维度定义如下：

If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j + 1$ , then $\Theta^{(j)}$ will be of dimension $s_{j+1}×(s_j+1)$ .

例如下图：layer1有2个输入节点，layer2有4个激活节点，那么 $\Theta^{(1)}$ 的尺寸为4x3。因为 $x_0$ 作为偏置项也得加上，所以要进行4行3列的映射。

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9p09ojvo-1651490803959)(G:\AI学习\ML\Week4.assets\image-20220430154203248.png)]$

Forward propagation: Vectorized implementation

Element-wise

element-wise 是两个张量之间的操作，它在相应张量内的对应的元素进行操作。

例如 $v$ 表示对每一个输入向量 $v\bigodot w =s$ 乘以一个给定的”权重" $w$ 向量。换句话说，就是通过一个乘子对数据集的每一列进行缩放。这个转换可以表示为如下的形式：
$\{v_1,v_2,v_3 \}^{T}\bigodot \{w_1,w_2,w_3 \}^{T}=\{v_1w_1,v_2w_2,v_3w_3 \}^{T}$

简而言之，同位元素对应相乘/向量的分量对应分别相乘。

Vectorized implementation

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZSU6oZBJ-1651490803961)(G:\AI学习\ML\Week4.assets\image-20220430161716863.png)]$

把 $g(Θ^{(1)}_{10}x_0+Θ^{(1)}_{11}x_1+Θ^{(1)}_{12}x_2+Θ^{(1)}_{13}x_3)$ 括号里的写成 $z_1^{(2)}$ ，上标2代表layer2。

$z^{(2)}$ 可以表示为权重矩阵 $\Theta_{(1)}$ 乘 $x$ ，这里的x为了统一形式可以记作layer1激活节点 $a^{(1)}$ 。计算对应的 $a^{(2)}$ 即为 $g(z^{(2)})$ ，这里计算是elememt-wise的。

layer2增加一个偏置项 $a_0^{(2)}=1$ ， $z^{(3)}$ 可以表示为权重矩阵 $\Theta^{(2)}$ 乘矩阵 $a^{(2)}$ 。故最后的 $h_\Theta(x)=a^{(3)}=g(z^{(3)})$

In this section we’ll do a vectorized implementation of the above functions. We’re going to define a new variable $z_k^{(j)}$ that encompasses the parameters inside our g function. In our previous example if we replaced by the variable z for all the parameters we would get:
$a^{(2)}_1=g(z^{(2)}_1)\\ a^{(2)}_2=g(z^{(2)}_2)\\ a^{(2)}_3=g(z^{(2)}_3)$
In other words, for layer j=2 and node k, the variable z will be:
$z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n$
The vector representation of x and $z^{j}$ is:
$x=\left[\begin{matrix} x_0\\ x_1\\ \cdots\\ x_n \end{matrix}\right]$
$z^{(j)}=\left[\begin{matrix} z_1^{(j)}\\ z_2^{(j)}\\ \cdots\\ z_n^{(j)} \end{matrix}\right]$
Setting $x = a^{(1)}$ , we can rewrite the equation as:
$z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$
We are multiplying our matrix $\Theta^{(j-1)}$ with dimensions $s_j\times (n+1)$ (where $s_j$ is the number of our activation nodes) by our vector $a^{(j-1)}$ with height (n+1). This gives us our vector $z^{(j)}$ with height $s_j$ . Now we can get a vector of our activation nodes for layer j as follows:
$a^{(j)} = g(z^{(j)})$
Where our function g can be applied element-wise to our vector $z^{(j)}$ .

We can then add a bias unit (equal to 1) to layer j after we have computed $a^{(j)}$ . This will be element $a_0^{(j)}$ and will be equal to 1. To compute our final hypothesis, let’s first compute another z vector:
$z^{(j+1)} = \Theta^{(j)}a^{(j)}$
We get this final z vector by multiplying the next theta matrix after $\Theta^{(j-1)}$ with the values of all the activation nodes we just got. This last theta matrix $\Theta^{(j)}$ will have only one row which is multiplied by one column $a^{(j)}$ so that our result is a single number. We then get our final result with:
$h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})$
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sG8im6Jd-1651490803961)(G:\AI学习\ML\Week4.assets\image-20220430164419787.png)]$

Neural Network learning its own features

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2hoHRnlv-1651490803961)(G:\AI学习\ML\Week4.assets\image-20220430163540126.png)]$

注意公式的蓝色部分，这看起来就像logistic regression，只是 $\theta$ 变成了 $\Theta$ 。而神经网络的独到之处，在于这里的 $a^{(2)}$ 不是固定输入的，而是由固定输入的 $x_1,x_2,x_3$ 学习而来的。

Other network architectures

You can have neural networks with other types of diagrams as well, and the way that neural networks are connected, that’s called the architecture.

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b4yMj3fU-1651490803962)(G:\AI学习\ML\Week4.assets\image-20220430164336204.png)]$

Applications

Examples and Intuitions I

Non-linear classification example: XOR/XNOR

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-w8bpThQw-1651490803962)(G:\AI学习\ML\Week4.assets\image-20220501145243178.png)]$

XOR：异或，x1和x2不同，y=1；否则y=0

XNOR：同或， x1和x2相同，y=1（上图中红叉）；否则y=0（上图中红圈）

XNOR 等价于 NOT XOR

Simple example: AND

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qjOIxMJn-1651490803962)(G:\AI学习\ML\Week4.assets\image-20220501155939648.png)]$

Example: OR function

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q7Dxn1f0-1651490803963)(G:\AI学习\ML\Week4.assets\image-20220501160104602.png)]$

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BwhnJtjf-1651490803963)(G:\AI学习\ML\Week4.assets\image-20220501155924467.png)]$

Examples and Intuitions II

Negation(NOT)

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-onJ35A05-1651490803963)(G:\AI学习\ML\Week4.assets\image-20220501160824430.png)]$

Putting it together: x1 XNOR x2

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GrM9ohvG-1651490803963)(G:\AI学习\ML\Week4.assets\image-20220501171032997.png)]$

hidden layer的复杂组合，能够使得神经网络计算复杂的非线性函数

Multiclass Classification

Multiple output units: One-vs-all

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Fp5dyWu6-1651490803964)(G:\AI学习\ML\Week4.assets\image-20220501172039953.png)]$

So this is just like the “one versus all” method that we talked about when we were describing logistic regression, and here we have essentially four logistic regression classifiers, each of which is trying to recognize one of the four classes that we want to distinguish amongst.

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-abIi6EOH-1651490803964)(G:\AI学习\ML\Week4.assets\image-20220501172254401.png)]$