Before going to the coding part, we must be knowing that why is there a need to split a single data into 2 subsets i.e. training data and test data.
在进行编码之前,我们必须知道为什么需要将单个数据分为2个子集,即训练数据和测试数据。
So, at first, we would be discussing the training data. We use training data to basically train our model. Training data is a complete set of feature variables or the independent variable and target variable or the dependent variable .so that our model is able to learn the value of target variable on a particular set of feature variables. When encountered with a large set of data we use the major portion of data as a training set.
因此,首先,我们将讨论训练数据。 我们使用训练数据来基本训练我们的模型。 训练数据是一组完整的特征变量或自变量,目标变量或因变量。因此,我们的模型能够学习特定特征变量集上目标变量的值。 当遇到大量数据时,我们将大部分数据用作训练集。
After supplying training data now it is the time to test that how much our model has learned from that data just like as humans in college after we learn our subjects we are required to give the test to clear the subject. We test our model by supplying the feature variables to our model and in turn, we see the value of the target variable predicted by our model. We generally take a minor portion of the whole data as the test set which is generally 25% or 33% of the complete data set.
在提供训练数据之后,现在是时候测试我们的模型从该数据中学到了多少,就像在大学中学习人类之后的人类一样,我们需要进行测试以清除该学科。 我们通过向模型提供特征变量来测试模型,然后我们可以看到模型预测的目标变量的值。 我们通常将整个数据的一小部分作为测试集,通常占整个数据集的25%或33%。
This figure below shows the splitting of data into test and training sets:
下图显示了将数据分为测试和训练集的情况:
Image source: http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png
图片来源:http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png
For performing the data splitting. I would be using this data set: headbrain1.CSV
用于执行数据拆分。 我将使用以下数据集: headbrain1.CSV
Python code: (The code along with its explanation is as follows)
Python代码:(该代码及其说明如下)
-
# -*- coding: utf-8 -*-
-
"""
-
Created on Sun Jul 29 22:21:12 2018
-
@author: RaunakGoswami
-
"""
-
import numpy as np
-
import pandas as pd
-
import matplotlib.pyplot as plt
-
#reading the data
-
"""here the directory of my code and the headbrain1.csv
-
file is same make sure both the files are stored in
-
same folder or directory"""
-
data=pd.read_csv('headbrain1.csv')
-
#this will show the first five records of the whole data
-
data.head()
-
#this will create a variable x which has the feature values
-
#i.e brain weight
-
x=data.iloc[:,2:3].values
-
#this will create a variable y which has the target value
-
#i.e brain weight
-
y=data.iloc[:,3:4].values
-
#splitting the data into training and test
-
"""the following statement written below will split
-
x and y into 2 parts:
-
1.training variables named x_train and y_train
-
2.test variables named x_test and y_test
-
The splitting will be done in the ratio of 1:4 as we have
-
mentioned the test_size as 1/4 of the total size"""
-
from sklearn.cross_validation import train_test_split
-
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=1/4,random_state=0)
-
#this will plot the scattered graph of the training set
-
plt.scatter(x_train,y_train,c='red')
-
plt.xlabel('headsize(train)')
-
plt.ylabel('brain weight(train)')
-
plt.show()
-
#this will plot the scattered graph of test set
-
plt.scatter(x_test,y_test,c='red')
-
plt.xlabel('headsize(test)')
-
plt.ylabel('brain weight(test)')
-
plt.show()
After you run this code, just look into the variable explorer and you will see something like this:
运行此代码后,只需查看变量资源管理器,您将看到类似以下内容:
As it is clearly visible that out of 237 rows ,177 rows are allotted to training variables and the remaining 60 rows are allotted to test variables which is roughly ¼ of the total dataset.
显而易见,在237行中,有177行分配给了训练变量,其余60行则分配给了测试变量,约占总数据集的1/4。
The graph below is a scattered graph of the training set variables:
下图是训练集变量的分散图:
The graph below is a scattered graph of test set values notice that the number of scattered red dots are lesser than those in training set:
下图是测试集值的分散图,请注意,分散的红点数量少于训练集中的数量:
That is it guys hope you enjoyed today’s article.
是的,希望大家喜欢今天的文章。
机器学习 图像分割
文章来源:机器学习 图像分割_数据分割| 机器学习_cumubi7453的博客-CSDN博客