优化：深度神经网络Tricks【笔记】-CFANZ编程社区

Slide：http://lamda.nju.edu.cn/weixs/slide/CNNTricks_slide.pdf

博文：http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

1)data augmentation;

2)pre-processing on images;

3)initializations of Networks;

4)some tips during training;

5)selections of activation functions;

6)diverse regularizations;

7)some insights found from figures and finally

8)methods of ensemble multiple deep networks.

Sec. 1: Data Augmentation

训练的时候，训练集有限，可以用Data Augmentation来扩充数据集合；

（1）、简单的crops： horizontally flipping, random crops andcolor jittering.
（2）、结合（1）中简单的处理
（3）、Krizhevsky et al. [1] 提出的 fancy PCA ： alters the intensities of the RGB channels in training images.

Sec. 2: Pre-Processing

（1）、 zero-center + normalize:

python实现

>>> X -= np.mean(X, axis = 0) # zero-center
>>> X /= np.std(X, axis = 0) # normalize

（2）、 PCA Whitening：zero-center-->计算covariance matrix（数据之间的correlation结构）-->decorrelate数据-->whitening

python实现

>>> X -= np.mean(X, axis = 0) # zero-center
>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

decorrelate data :通过将原来的数据（除了zero-centres）映射带eigenbasis

>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
>>> Xrot = np.dot(X, U) # decorrelate the data

whitening:用eigenvalue将eigenbasis的每个维度分开来normalize the scale

>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

Sec. 3: Initializations

（1）、All Zero Initialization

理想状态下认为一般权重为正数一半为负数再见过适当的data normalization

缺点：no source of asymmetry between neurons

（2）、Initialization with Small Random Numbers：

优点：symmetry breaking

思想：the neurons are all random and unique in the beginning,

eg1：

, where

is a zero mean, unit standard deviation gaussian.

eg2：small numbers drawn from a uniform distribution,

（3）、Calibrating the Variances

思想：normalize the variance of each neuron's output to 1 ，但是不会考虑ReLUs

python实现:

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

（4）、Current Recommendation

He et al. [4] 关注 ReLUs：variance ：

python实现：

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation.

Sec. 4: During Training

Filters and pooling size. input images： power-of-2 ； filter (e.g.,
) ；strides (e.g., 1) with zeros-padding; pooling :eg:
.
Learning rate.利用validation set ，再次 Ilya Sutskever [2]：divide the gradients by mini batch size
Fine-tune on pre-trained models. 考虑：新的数据集的大小&和预训练模型训练数据集的相似性
（1）、如果自己的数据和预训练的相似，直接在从预训练模型的高层提取的特征尚训练一个 linear classifier
（2）、如果有许多数据,可以用small learning rate微调预训练模型的高层
（3）、如果自己的数据集和预训练模型的数据集差异很大，但是有很多训练图像,大部分的layers需要用小的learning rate在自己的数据集上进行 fine-tuned
（4）、如果自己的数据集小而且与预训练模型数据集差异很大，那就只训练一个 linear classifier.

Sec. 5: Activation Functions ：non-linearity

Sigmoid

	large negative numbers become 0 and large positive numbers become 1. sigmoids saturate and kill gradients. . Sigmoid outputs are not zero-centered.

tanh(x)

	range [-1, 1]. 1、 its activations saturate 2、zero-centered

Rectified Linear Unit

	(Pros) do expensive operations (exponentials, etc.), (Pros) ReLUs does not suffer from saturating. (Pros) accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic gradient descent (linear, non-saturating form.) (Cons) fragile during training and can “die”.

Leaky ReLU

	fix the “dying ReLU” problem. if ( : a small constant) if , (cons)the results are not always consistent.

Parametric ReLU :

	PReLU, is learned from data not pre-defiined[[4]] Leaky ReLU is fixed. RReLU, is a random variable in a given range in the training, and then fixed in the testing[[5]] (cons) reduce overfitting

Randomized ReLU

RReLU,

在训练时是给定范围的随机变量 ,但在测试时是固定的。[[5]]

Sec. 6: Regularizations

L2 regularization : add
to the objective,
:regularization strength. （ heavily penalizing peaky weight vectors and preferring diffuse weight vectors）
L1 regularization: add
to the objective. 结合:
（Elastic net regularization).
Max norm constraints. enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.
.
(always 3 or 4).update are bounded so the nwtwork wont explores..
Dropout ： [6] only updating the parameters of the sampled network based on the input data .

	[6]. training： keeping a neuron active with some probability (a hyper-parameter), or setting it to zero . testing： no dropout dropout ratio is a reasonable default

Sec. 7: Insights from Figures

learning rate
loss curve.： the “width” of the curve is related to the batch size.
accuracy curve.

优化：深度神经网络Tricks【笔记】_python实现_10

Sec. 8: Ensemble[8]

Same model, different initialization. 用交叉验证集来决定最好的超参数 hyperparameters, 然后用这些超参数来训练多个 models ，但是随机初始化.
Top models discovered during cross-validation. 用交叉验证集来决定最好的超参数 hyperparameters,然后选出前n个最好的models来ensemble.（风险是可能包含未达标准的model）.
Different checkpoints of a single model. training非常expensive的情况下, 选取一个single network中不同时刻的不同的 checkpoints 来ensemble. （缺乏多样性，但是cheap）.
Some practical examples. 如果你的任务是high-level image semantic：可以在不同的数据集上使用多个深度模型来提取不同的互补的深度representations.

Miscellaneous

Problems:

data:class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images.

method1:balance the training data by directly up-sampling and down-sampling the imbalanced data[10].

method2: crops processing[7].

method3 :adjust the fine-tuning strategy