机器学习图像分割-CFANZ编程社区

Before going to the coding part, we must be knowing that why is there a need to split a single data into 2 subsets i.e. training data and test data.

在进行编码之前，我们必须知道为什么需要将单个数据分为2个子集，即训练数据和测试数据。

So, at first, we would be discussing the training data. We use training data to basically train our model. Training data is a complete set of feature variables or the independent variable and target variable or the dependent variable .so that our model is able to learn the value of target variable on a particular set of feature variables. When encountered with a large set of data we use the major portion of data as a training set.

因此，首先，我们将讨论训练数据。我们使用训练数据来基本训练我们的模型。训练数据是一组完整的特征变量或自变量，目标变量或因变量。因此，我们的模型能够学习特定特征变量集上目标变量的值。当遇到大量数据时，我们将大部分数据用作训练集。

After supplying training data now it is the time to test that how much our model has learned from that data just like as humans in college after we learn our subjects we are required to give the test to clear the subject. We test our model by supplying the feature variables to our model and in turn, we see the value of the target variable predicted by our model. We generally take a minor portion of the whole data as the test set which is generally 25% or 33% of the complete data set.

在提供训练数据之后，现在是时候测试我们的模型从该数据中学到了多少，就像在大学中学习人类之后的人类一样，我们需要进行测试以清除该学科。我们通过向模型提供特征变量来测试模型，然后我们可以看到模型预测的目标变量的值。我们通常将整个数据的一小部分作为测试集，通常占整个数据集的25％或33％。

This figure below shows the splitting of data into test and training sets:

下图显示了将数据分为测试和训练集的情况：

splitting of data into test and training sets

Image source: http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png

图片来源：http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png

For performing the data splitting. I would be using this data set: headbrain1.CSV

用于执行数据拆分。 我将使用以下数据集： headbrain1.CSV

Python code: (The code along with its explanation is as follows)

Python代码：(该代码及其说明如下)

# -*- coding: utf-8 -*-
"""
Created on Sun Jul 29 22:21:12 2018
@author: RaunakGoswami
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#reading the data
"""here the directory of my code and the headbrain1.csv
file is same make sure both the files are stored in
same folder or directory"""
data=pd.read_csv('headbrain1.csv')
#this will show the first five records of the whole data
data.head()
#this will create a variable x which has the feature values
#i.e brain weight
x=data.iloc[:,2:3].values
#this will create a variable y which has the target value
#i.e brain weight
y=data.iloc[:,3:4].values
#splitting the data into training and test
"""the following statement written below will split
x and y into 2 parts:
1.training variables named x_train and y_train
2.test variables named x_test and y_test
The splitting will be done in the ratio of 1:4 as we have
mentioned the test_size as 1/4 of the total size"""
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=1/4,random_state=0)
#this will plot the scattered graph of the training set
plt.scatter(x_train,y_train,c='red')
plt.xlabel('headsize(train)')
plt.ylabel('brain weight(train)')
plt.show()
#this will plot the scattered graph of test set
plt.scatter(x_test,y_test,c='red')
plt.xlabel('headsize(test)')
plt.ylabel('brain weight(test)')
plt.show()

After you run this code, just look into the variable explorer and you will see something like this:

运行此代码后，只需查看变量资源管理器，您将看到类似以下内容：

variable explorer

As it is clearly visible that out of 237 rows ,177 rows are allotted to training variables and the remaining 60 rows are allotted to test variables which is roughly ¼ of the total dataset.

显而易见，在237行中，有177行分配给了训练变量，其余60行则分配给了测试变量，约占总数据集的1/4。

The graph below is a scattered graph of the training set variables:

下图是训练集变量的分散图：