MLPClassifier 隐藏层不包括输入和输出-CFANZ编程社区

多层感知机（MLP）原理简介

多层感知机（MLP，Multilayer Perceptron）也叫人工神经网络（ANN，Artificial Neural Network），除了输入输出层，它中间可以有多个隐层，最简单的MLP只含一个隐层，即三层的结构，如下图：

MLPClassifier 隐藏层不包括输入和输出_神经网络

从上图可以看到，多层感知机层与层之间是全连接的（全连接的意思就是：上一层的任何一个神经元与下一层的所有神经元都有连接）。多层感知机最底层是输入层，中间是隐藏层，最后是输出层。

输入层没什么好说，你输入什么就是什么，比如输入是一个n维向量，就有n个神经元。

隐藏层的神经元怎么得来？首先它与输入层是全连接的，假设输入层用向量X表示，则隐藏层的输出就是

f(W1X+b1)，W1是权重（也叫连接系数），b1是偏置，函数f 可以是常用的sigmoid函数或者tanh函数：

MLPClassifier 隐藏层不包括输入和输出_机器学习_02

MLPClassifier 隐藏层不包括输入和输出_激活函数_03

最后就是输出层，输出层与隐藏层是什么关系？其实隐藏层到输出层可以看成是一个多类别的逻辑回归，也即softmax回归，所以输出层的输出就是softmax(W2X1+b2)，X1表示隐藏层的输出f(W1X+b1)。

sklearn.neural_network.MLPClassifier

class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001, batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10)[source]

Multi-layer Perceptron classifier.

This model optimizes the log-loss function using LBFGS or stochastic gradient descent.

New in version 0.18.

Parameters:	hidden_layer_sizes : tuple, length = n_layers - 2, default (100,) The ith element represents the number of neurons in the ith hidden layer. activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default ‘relu’ Activation function for the hidden layer. • ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x • ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)). • ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x). • ‘relu’, the rectified linear unit function, returns f(x) = max(0, x) solver : {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’ The solver for weight optimization. • ‘lbfgs’ is an optimizer in the family of quasi-Newton methods. • ‘sgd’ refers to stochastic gradient descent. • ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better. alpha : float, optional, default 0.0001 L2 penalty (regularization term) parameter. batch_size : int, optional, default ‘auto’ Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples) learning_rate : {‘constant’, ‘invscaling’, ‘adaptive’}, default ‘constant’ Learning rate schedule for weight updates. • ‘constant’ is a constant learning rate given by ‘learning_rate_init’. • ‘invscaling’ gradually decreases the learning rate learning_rate_ at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t) • ‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5. Only used when solver='sgd'. learning_rate_init : double, optional, default 0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’. power_t : double, optional, default 0.5 The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.

Parameters:

hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
The ith element represents the number of neurons in the ith hidden layer.
activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default ‘relu’
Activation function for the hidden layer.
• ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x
• ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).
• ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).
• ‘relu’, the rectified linear unit function, returns f(x) = max(0, x)
solver : {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’
The solver for weight optimization.
• ‘lbfgs’ is an optimizer in the family of quasi-Newton methods.
• ‘sgd’ refers to stochastic gradient descent.
• ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba
Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.
alpha : float, optional, default 0.0001
L2 penalty (regularization term) parameter.
batch_size : int, optional, default ‘auto’
Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)
learning_rate : {‘constant’, ‘invscaling’, ‘adaptive’}, default ‘constant’
Learning rate schedule for weight updates.
• ‘constant’ is a constant learning rate given by ‘learning_rate_init’.
• ‘invscaling’ gradually decreases the learning rate learning_rate_ at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t)
• ‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5.
Only used when solver='sgd'.
learning_rate_init : double, optional, default 0.001
The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.
power_t : double, optional, default 0.5
The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.