假设检验代码篇
假设检验常见的有单样本T-检验、双样本T-检验、成对T-检验、方差分析等。详细见如下代码部分。
from scipy import stats
import pandas as pd
# 1 One-Sample T-Test
#原假设为住院女医生的血压与一般人群的血压无显著差异,即和一般人群的血压(120)差异不大,以下为血压数据:
female_doctor_bps = [128, 127, 118, 115, 144, 142, 133, 140, 132, 131,
111, 132, 149, 122, 139, 119, 136, 129, 126, 128]
d = pd.DataFrame(female_doctor_bps);
d.columns=["amt"]
d_ref=120
d_std=d.std()[0]
d_n =d.shape[0]
##d_free=d.shape[0]-1
d_se=d_std/(d_n**0.5)
d_tvalue=(d.mean()[0]-d_ref)/(d_se)
print("one-sampe T-test:\tT values is:"+str(d_tvalue))
print(stats.ttest_1samp(female_doctor_bps, 120))
## 本例p值为0.0002,远低于0.05或0.01的标准阈值,因此我们拒绝原假设,可以认为住院女医生的静息收缩压与一般人群有差异。
# 2 Two-sample T-test
female_doctor_bps = [128, 127, 118, 115, 144, 142, 133, 140, 132, 131,
111, 132, 149, 122, 139, 119, 136, 129, 126, 128]
male_consultant_bps = [118, 115, 112, 120, 124, 130, 123, 110, 120, 121,
123, 125, 129, 130, 112, 117, 119, 120, 123, 128]
d_femal=pd.DataFrame(female_doctor_bps)
d_male=pd.DataFrame(male_consultant_bps)
d_femal_mean=d_femal.mean()[0]
d_male_mean=d_male.mean()[0]
d_femal_var = d_femal.var()[0]
d_male_var = d_male.var()[0]
d_femal_n = d_femal.shape[0]
d_male_n = d_male.shape[0]
d_sp=((d_femal_n-1)*d_femal_var + (d_male_n-1)*d_male_var)/(d_femal_n+d_male_n-2)
d_t = (d_femal_mean - d_male_mean)/((d_sp*(1/d_femal_n+1/d_male_n))**0.5)
print("Two-sample T-test:\tT values is:"+str(d_t))
print(stats.ttest_ind(female_doctor_bps, male_consultant_bps))
#p值是0.0012,这比标准阈值低于0.05或0.01,所以我们拒绝零假设,我们可以说女医生和男医生的舒张压有显著差异。
#3 Paired T-Test
control = [8.0, 7.1, 6.5, 6.7, 7.2, 5.4, 4.7, 8.1, 6.3, 4.8]
treatment = [9.9, 7.9, 7.6, 6.8, 7.1, 9.9, 10.5, 9.7, 10.9, 8.2]
d_control=pd.DataFrame(control)
d_treatment=pd.DataFrame(treatment)
d_diff = d_treatment - d_control
d_mean = d_diff.mean()[0]
d_treatment_std = d_diff.std()[0]
d_treatment_n = d_treatment.shape[0]
d_t = (d_mean)/(d_treatment_std/(d_treatment_n**0.5))
print("Paired T-Test:\tT values is:"+"\t"+str(d_t))
print(stats.ttest_rel(control, treatment))
#p值为0.0055,低于0.05或0.01的标准阈值,因此我们拒绝原假设,我们可以说,由安眠药引起的睡眠时间有差异。
# 4 Analysis of Variance (ANOVA)
#ctrl = [4.17, 5.58, 5.18, 6.11, 4.5, 4.61, 5.17, 4.53, 5.33, 5.14]
#trt1 = [4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69]
#trt2 = [6.31, 5.12, 5.54, 5.5, 5.37, 5.29, 4.92, 6.15, 5.8, 5.26]
## 这里样本量是一样的,每组的样本量可以不一样。
ctrl = [4.17, 5.58, 5.18]
trt1 = [4.81, 4.17, 4.41]
trt2 = [6.31, 5.12, 5.54]
d_group = 3
d_ctr1 = pd.DataFrame(ctrl)
d_trt1 = pd.DataFrame(trt1)
d_trt2 = pd.DataFrame(trt2)
## 样本相加除以总样本数,总体均值(总共9个样本)
d_total_mean=(d_ctr1.sum()[0]+d_trt1.sum()[0]+d_trt2.sum()[0])/d_ctr1.shape[0]/d_group
##print(d_total_mean)
d_ctr1_mean=d_ctr1.mean()[0]
d_trt1_mean=d_trt1.mean()[0]
d_trt2_mean=d_trt2.mean()[0]
d_ctr1_n=d_ctr1.shape[0]
d_trt1_n=d_trt1.shape[0]
d_trt2_n=d_trt2.shape[0]
#
## 组间平方和(SSA)
d_ssa=(d_ctr1_mean-d_total_mean)**2*d_ctr1_n+ \
(d_trt1_mean-d_total_mean)**2*d_trt1_n+ \
(d_trt2_mean-d_total_mean)**2*d_trt2_n
##print("组间平方和(SSA):\t"+str(d_ssa))
## 组内平方和(SSE):
d_sse=(4.17-d_ctr1_mean)**2+(5.58-d_ctr1_mean)**2+(5.18-d_ctr1_mean)**2+\
(4.81-d_trt1_mean)**2+(4.17-d_trt1_mean)**2+(4.41-d_trt1_mean)**2+\
(6.31-d_trt2_mean)**2+(5.12-d_trt2_mean)**2+(5.54-d_trt2_mean)**2
##print("组内平方和(SSE):\t" + str(d_sse))
#总体平方和(SST):
d_sst = (4.17-d_total_mean)**2+(5.58-d_total_mean)**2+(5.18-d_total_mean)**2 +\
(4.81-d_total_mean)**2+(4.17-d_total_mean)**2+(4.41-d_total_mean)**2 +\
(6.31-d_total_mean)**2+(5.12-d_total_mean)**2+(5.54-d_total_mean)**2
##print("总体平方和(SST):\t"+str(d_sst))
#组间均方(MSA) = SSA/自由度
d_msa = d_ssa/(d_group-1)
#组内均方(MSE) = SSE/自由度
d_mse = d_sse/(d_ctr1_n+d_ctr1_n+d_ctr1_n-d_group)
#MSA又称为组间方差,MSE称为组内方差
d_f = d_msa/d_mse
print("Analysis of Variance (ANOVA) f values:\t"+str(d_f))
print(stats.f_oneway(ctrl, trt1, trt2))
# 5 chi-squared test w
from scipy.stats import chi2_contingency
from scipy.stats import chi2
table = [ [10, 20, 30],
[6, 9, 17]]
stat, p, dof, expected = chi2_contingency(table)
print('dof=%d' % dof) #degrees of freedom: (rows - 1) * (cols - 1)
##print(expected) 打印每列的期望值
# 以第一列第一行为例,算期望值
print("第一行第一列期望值:\t"+str('%.8f'%((10+6)/(10+6+20+9+30+17)*(10+20+30) )))
#[10.43478261 18.91304348 30.65217391]
#[5.56521739 10.08695652 16.34782609]
print('卡方值:\t'+str('%.10f'%(
(10-10.43478261)**2/(10.43478261)+(20-18.91304348)**2/(18.91304348)+(30-30.65217391)**2/(30.65217391)+
(6-5.56521739)**2/(5.56521739)+(9-10.08695652)**2/(10.08695652)+(17-16.34782609)**2/(16.34782609)
)))
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.8f' % (prob, critical, stat))
#这里p值大于0.05,所以接受原假设,即两样本之间没有显著差异,样本均值无差异
if abs(stat) >= critical:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
# interpret p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (fail to reject H0)')
执行结果:
"F:\Python37\python.exe" E:/hypothesistest.py
one-sampe T-test: T values is:4.512403659336718
Ttest_1sampResult(statistic=4.512403659336718, pvalue=0.00023838063630967753)
Two-sample T-test: T values is:3.5143256412718564
Ttest_indResult(statistic=3.5143256412718564, pvalue=0.0011571376404026158)
Paired T-Test: T values is: 3.624485995178213
Ttest_relResult(statistic=-3.6244859951782136, pvalue=0.0055329408161001415)
Analysis of Variance (ANOVA) f values: 3.23528624933119
F_onewayResult(statistic=3.2352862493311934, pvalue=0.11137675915188745)
dof=2
第一行第一列期望值: 10.43478261
卡方值: 0.2715746509
probability=0.950, critical=5.991, stat=0.27157465
Independent (fail to reject H0)
significance=0.050, p=0.873
Independent (fail to reject H0)
Process finished with exit code 0
相关配图:
1 One-Sample T-Test
注:SE即standard error 即样本的标准误
2 Two-sample T-test
注: 适用于判断两个样本是否独立或者相关
3 Paired T-Test
注: 1 其中
是两个成对样本均值的差,s是两个样本对应相减算出的标准差。
2 适用于比较两个相关的样本,比如测试前后的变化。
4 Analysis of Variance (ANOVA)
计算过程详见代码部分。
5 chi-squared test
每个A对应的期望值T: (所在的纵向和/总和)*所在的横向和
卡方值0.27157465在0.21和0.45之间,所以p值在0.80和0.90之间。通过计算的0.873
6 refer
Comparative Statistics in Python using SciPy – Ben Alex Keen
A Gentle Introduction to the Chi-Squared Test for Machine Learning