Classification Example-CFANZ编程社区

This example uses the Prostate dataset and H2O’s GLM algorithm to predict the likelihood of a patient being diagnosed with prostate cancer.（分类病人被诊断为前列腺癌）

The dataset includes the following columns:

ID: A row identifier. This can be dropped from the list of predictors.【标识符，可以删除】
CAPSULE: Whether the tumor penetrated the prostatic capsule【肿瘤是否穿透前列腺包膜】
AGE: The patient’s age【病人的年龄】
RACE: The patient’s race【病人的种族】
DPROS: The result of the digital rectal exam, where 1=no nodule; 2=unilober nodule on the left; 3 =unilibar nodule on the right; and 4=bilobar nodule.【直肠指诊结果，其中1=无结节;2=左侧单房结节;3 =右侧单侧结节;4 = bilobar结节】
DCAPS: Whether there existed capsular involvement on the rectal exam【直肠检查是否有包膜受累】
PSA: The Prostate Specific Antigen Value (mg/ml)【前列腺特异抗原值(mg/ml)】
VOL: The tumor volume (cm3)【肿瘤体积(cm3)】
GLEASON: The patient’s Gleason score in the range 0 to 10【患者的Gleason评分在0到10之间】

This example uses only the AGE, RACE, VOL, and GLEASON columns to make the prediction.


import h2o
h2o.init()
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# import the prostate dataset
prostate = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")

# convert columns to factors
# asfactor():将列转为枚举类型，为了实现分类
prostate['CAPSULE'] = prostate['CAPSULE'].asfactor()
prostate['RACE'] = prostate['RACE'].asfactor()
prostate['DCAPS'] = prostate['DCAPS'].asfactor()
prostate['DPROS'] = prostate['DPROS'].asfactor()

# set the predictor【根据什么来预测】 and response columns【预测的结果】
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
response_col = "CAPSULE"

# split into train and testing sets
train, test = prostate.split_frame(ratios = [0.8], seed = 1234)

# set GLM modeling parameters and initialize model training
glm_model = H2OGeneralizedLinearEstimator(family= "binomial", lambda_ = 0, compute_p_values = True)
glm_model.train(predictors, response_col, training_frame= prostate)

# 保存模型
model_path = h2o.save_model(model=glm_model, path="/tmp/mymodel", force=True)
# 打印出保存模型的路径：
print("模型保存在：", model_path)

# load the model,加载模型，要注意模型的位置是/而不是\
saved_model = h2o.load_model("E:/tmp/mymodel/GLM_model_python_1644543968601_1")
glm_model= saved_model



# predict using the model and the testing dataset
predict = glm_model.predict(test)

# View a summary of the prediction
print(predict.head())