1.多音子数据分析:甲方皮实献佩奇
甲:2.假设检验:渐渐献祭
建(渐)立假设H0反命题H1,选择检(渐)验统计量,
根据显(献)著水平一般0.05确定拒绝域,计算(祭)p值做出判断
方:3.方差检验:sm比se了 * mm-1逼n-mm
他SST = (每个-均值)平方和,妈SSM = (每组均值-均值)平方求和
的SSE = (每个 - 每组)平方和,烦F= SSM/SSE * m -1/n - m (m组,n个)
皮:5.皮尔逊系数:鞋笔表,贼逼样
主要观察两组数据的相关性
两组数据各减去均值相乘取期望值(协方差)/标准差
Z分数相乘/样本总数
实:6.斯皮尔曼系数:一检六查房,呵,必暗访
主要观察名次差相关性
献:7.线性回归:功盖西餐
公式y=kx+b,概念:两个或两个以上的变量存在依赖关系, 关键指标:决定系数,残差值
佩:8.PCA:主成分分析:shuit,靠偷
求特证协方差矩阵,
求协方差矩阵的特征值和特征向量,
排序选K个,
将样本点投射到特征向量上
主要的作用就是降维.
奇:9.奇异值分解:哎呦喂
特征矩阵为A,分解为m*m的酉阵U,m*n半正定矩阵(奇异矩阵),n*n酉阵转置V
A=U ∑V平方t(转置)
12.做线性回归分析的过程:抱恋欲系
1.导入lr包,取出
2.训练fit
3.预测 predict
4.取系数coef_,取截距intercept
13.PCA降维的python实现:保卫球子
1.导包,取包
2.设置维度PCA(n_components = 1)
3.求重要性 .explained_variance_ration
4.转换后的数值:.fit_transform(data)
14.复合分析:差分印象
1.交叉分析
2.因子分析
3.分组与钻取
4.相关分析
特征工程
#多因子数分 import numpy as np import scipy.stats as ss norm_list = ss.norm.rvs(size = 20) ss_test = ss.normaltest(norm_list) # print(ss_test) #卡方检验 ss_chi = ss.chi2_contingency([[15, 95], [85, 5]]) # print(ss_chi) ss_t = ss.ttest_ind(ss.norm.rvs(size = 10), ss.norm.rvs(size=20)) # print(ss_t) ss_tt2 = ss.ttest_ind(ss.norm.rvs(size = 100), ss.norm.rvs(size = 200)) # print(ss_tt2) ss_one = ss.f_oneway([49,50,39,40,43],[28,32,30,26,34],[38,40,45,42,48]) # print(ss_one) #曲线散点图,平分线重合 from statsmodels.graphics.api import qqplot from matplotlib import pyplot as plt # plt.show(qqplot(ss.norm.rvs(size=100))) import pandas as pd s1 = pd.Series([0.1, 0.2, 1.1, 2.4, 1.3, 0.3, 0.5 ]) s2 = pd.Series([0.5, 0.4, 1.2, 2.5, 1.1, 0.7, 0.1 ]) key1 = s1.corr(s2,method = "spearman") # print(key1) df = pd.DataFrame(np.array([s1, s2]).T) df_key = df.corr(method= 'spearman') # print(df_key) #回归的例子 # x = np.arange(10).astype(np.float).reshape((10,1)) # y = x * 3 +4 +np.random.random((10,1)) # from sklearn.linear_model import LinearRegression as LR # # lr = LR() # # data = lr.fit(x, y) # # predict_y = lr.predict(x) # print(predict_y) # print(data.intercept_,data.coef_) #pca变换, # data = np.array([np.array([2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1]), # np.array([2.4, 0.7, 2.9, 2.2, 3, 2.7, 1.6, 1.1, 1.6, 0.9])]).T data = np.array([np.array([2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1]),np.array([2.4, 0.7, 2.9, 2.2, 3, 2.7, 1.6, 1.1, 1.6, 0.9])]).T # print(data) from sklearn.decomposition import PCA lower_dim = PCA(n_components = 1) fit_pca = lower_dim.fit(data) # predict_pca_y = lower_dim.predict(data) # print(lower_dim.explained_variance_ratio_) # print(lower_dim.fit_transform(data)) # def myPCA(data, n, compentes = 100000000): # mean_vals = np.mean(data,axis=0) # mid =data = data-mean_vals # cov_mat = np.cov(mid,rowvar=False) # from scipy import linalg # eig_vals,eig_vects = linalg.eig(np.mat(cov_mat)) # eig_val_index = np.argsort(eig_vals) # eig_val_index = eig_val_index[:-(n_components+1):-1] # eig_vects = eig_vects[:,eig_vals_index] import matplotlib.pyplot as plt import seaborn as sns sns.set_context(font_scale=3) ####################################################### #1.求通过t独立检验求各个部门的离职率情况 df = pd.read_csv("./data/HR.csv") # # arange = df["salary"].value_counts() # # print(arange) # #独立t检验方法 # dp_indices = df.groupby(by = "department").indices # # print(dp_indices)#求出每个类别的索引 # sales_values = df["left"].iloc[dp_indices["sales"]].values # technical_values = df["left"].iloc[dp_indices["technical"]].values # # print(ss.ttest_ind(sales_values,technical_values)[1]) # dp_keys = list(dp_indices.keys()) # dp_t_mat=np.zeros([len(dp_keys),len(dp_keys)]) # for i in range(len(dp_keys)): # for j in range (len(dp_keys)): # p_values = ss.ttest_ind(df["left"].iloc[dp_indices[dp_keys[i]]].values,\ # df["left"].iloc[dp_indices[dp_keys[j]]].values)[1] # if p_values<0.05: # dp_t_mat[i][j] = -1 # else: # dp_t_mat[i][j] = p_values # sns.heatmap(dp_t_mat,xticklabels=dp_keys,yticklabels=dp_keys) # plt.show() # piv_tb = pd.pivot_table(df,values="left",index=["promotion_last_5years","salary"],\ # columns = ["Work_accident"],aggfunc=np.mean) # # print(piv_tb) # sns.heatmap(piv_tb,vmax = 1, vmin = 0, cmap=sns.color_palette("Reds",n_colors = 256)) # #plt.show() # #分组分析与钻取 # sns.barplot(x="salary", y="left", hue="department", data=df) # plt.show() # s1_s = df["satisfaction_level"] # sns.barplot(list(range(len(s1_s))),s1_s.sort_values()) # plt.show() #相关分析是衡量两组样本的相关大小 # sns.heatmap(df.corr(),vmin=-1,vmax=1,cmap=sns.color_palette("RdBu",n_colors=128)) # plt.show()]