因果推断dowhy之-401(k)资格对净金融资产的影响-CFANZ编程社区

0x01. 问题背景

在本案例研究中，我们将使用来自401(k)分析的真实数据来解释如何使用因果库来估计平均治疗效果(ATE)和条件ATE (CATE)。

本案例数据来自真实数据。在20世纪80年代初，美国政府为雇员推出了几种税收递延储蓄选择，以增加个人退休储蓄。一个受欢迎的选择是401(k)计划，该计划允许员工将工资的一部分存入个人账户。考虑到由于个人特征(特别是收入)造成的差异性，这里的目标是了解401(k)资格对净金融资产(即401(k)余额和非401(k)资产的总和)的影响。

由于401(k)计划是由雇主提供的，因此只有提供这些计划的公司的员工才有资格参加。因此，我们正在处理一项非随机研究。有几个因素(如教育程度、储蓄偏好)会影响401(k)计划的资格以及净金融资产。

0x02. 数据

我们考虑的样本来自1991年的收入和项目参与调查。样本由家庭组成，其中参考个人年龄在25-64岁之间，至少有一人受雇，但没有人是自雇的。样本中有9915户的记录。对于每个家庭，记录了44个变量，包括家庭参考人参加401(k)计划的资格(治疗)、净金融资产(结果)和其他协变量，如年龄、收入、家庭规模、教育程度、婚姻状况等。我们特别考虑了16个协变量。部分变量总结如下：

Variable Name	Type	Details
e401	Treatment	eligibility for the 401(k) plan
net_tfa	Outcome	net financial assets (in USD)
age	Covariate	Age
inc	Covariate	income (in USD)
fsize	Covariate	family size
educ	Covariate	education (in years)
male	Covariate	is a male?
db	Covariate	defined benefit pension
marr	Covariate	married?
twoearn	Covariate	two earners
pira	Covariate	participation in IRA
hown	Covariate	home owner?
hval	Covariate	home value (in USD)
hequity	Covariate	home equity (in USD)
hmort	Covariate	home mortgage (in USD)
nohs	Covariate	no high-school? (one-hot encoded)
hs	Covariate	high-school? (one-hot encoded)
smcol	Covariate	some-college? (one-hot encoded)

该数据集可从’ hdm https://rdrr.io/cran/hdm/man/pension.html ’ __ R包中在线公开获得。为了更方便的做实验，经过一系列的实验，将数据下载至本地进行实验。具体如何搞到的数据，参考整理的博客：hdm数据R语言获取教程

0x03. 实验

0x03_1. 读取数据

import pandas as pd
df = pd.read_csv("data/pension.csv")
df.head()

数据结果如下：

	ira	a401	hval	hmort	hequity	nifa	net_nifa	tfa	net_tfa	tfa_he	tw	age	inc	fsize	educ	db	marr	male	twoearn	dum91	e401	p401	pira	nohs	hs	smcol	col	icat	ecat	zhat	net_n401	hown	i1	i2	i3	i4	i5	i6	i7	a1	a2	a3	a4	a5
0	0	0	69000	60150	8850	100	-3300	100	-3300	5550	53550	31	28146	5	12	0	1	0	0	1	0	0	0	0	1	0	0	3	2	0.273178	-3300	1	0	0	1	0	0	0	0	0	1	0	0	0
1	0	0	78000	20000	58000	61010	61010	61010	61010	119010	124635	52	32634	5	16	0	0	0	0	1	0	0	0	0	0	0	1	4	4	0.386641	61010	1	0	0	0	1	0	0	0	0	0	0	1	0
2	1800	0	200000	15900	184100	7549	7049	9349	8849	192949	192949	50	52206	3	11	0	1	1	1	1	0	0	1	1	0	0	0	6	1	0.533650	8849	1	0	0	0	0	0	1	0	0	0	0	1	0
3	0	0	0	0	0	2487	-6013	2487	-6013	-6013	-513	28	45252	4	15	0	1	0	1	1	0	0	0	0	0	1	0	5	3	0.324319	-6013	0	0	0	0	0	1	0	0	1	0	0	0	0
4	0	0	300000	90000	210000	10625	-2375	10625	-2375	207625	212087	42	33126	3	12	1	0	0	0	1	0	0	0	0	1	0	0	4	2	0.602807	-2375	1	0	0	0	1	0	0	0	0	0	1	0	0

0x03_2. 构建因果图

使用e401作为Treatment，net_tfa作为out_come，其他协变量作为confounder，构建因果图代码如下：

import networkx as nx
import dowhy.gcm as gcm

treatment_var = "e401"
outcome_var = "net_tfa"
covariates = ["age","inc","fsize","educ","male","db",
              "marr","twoearn","pira","hown","hval",
              "hequity","hmort","nohs","hs","smcol"]

edges = [(treatment_var, outcome_var)]
edges.extend([(covariate, treatment_var) for covariate in covariates])
edges.extend([(covariate, outcome_var) for covariate in covariates])

causal_graph = nx.DiGraph(edges)
gcm.util.plot(causal_graph, figure_size=[20, 20])

效果图如下：

因果推断dowhy之-401(k)资格对净金融资产的影响_开发语言

0x03_3. 数据分析

在我们为变量分配因果模型之前，让我们绘制它们的直方图，以了解变量的分布。

import matplotlib.pyplot as plt

cols = [treatment_var, outcome_var]
cols.extend(covariates)
plt.figure(figsize=(10,5))
for i, col in enumerate(cols):
    plt.subplot(3,6,i+1)
    plt.grid(False)
    plt.hist(df[col])
    plt.xlabel(col)
plt.tight_layout()
plt.show()

绘制结果如下

因果推断dowhy之-401(k)资格对净金融资产的影响_拟合_02

我们观察到，实值变量不遵循众所周知的参数分布，如高斯分布。因此，当这些变量没有父变量时，我们拟合经验分布，这也适用于分类变量。

0x03_4. 数据增加噪声增强鲁棒性

让我们将因果模型分配给变量。对于treatment变量，我们分配了一个带有随机森林分类器的分类器功能因果模型(FCM)。对于outcome变量，我们分配了一个加性噪声模型，其中随机森林回归作为函数和噪声的经验分布。我们将经验分布分配给其他变量，因为它们在因果图中没有父节点。

causal_model = gcm.StructuralCausalModel(causal_graph)
causal_model.set_causal_mechanism(treatment_var, gcm.ClassifierFCM(gcm.ml.create_random_forest_classifier()))
causal_model.set_causal_mechanism(outcome_var, gcm.AdditiveNoiseModel(gcm.ml.create_random_forest_regressor()))
for covariate in covariates:
    causal_model.set_causal_mechanism(covariate, gcm.EmpiricalDistribution())

为了适合分类器FCM，我们将处理列转换为字符串类型。

df = df.astype({treatment_var: str})

0x03_5. 从数据中拟合因果模型

gcm.fit(causal_model, df)

输出如下：

Fitting causal mechanism of node smcol: 100%|██████████| 18/18 [00:06<00:00,  2.68it/s]

在计算CATE之前，我们首先将家庭划分为收入百分位数的等宽箱（equi-width bins）。这使我们能够研究对不同收入群体的影响。

import numpy as np

percentages = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
bin_edges = [0]
bin_edges.extend(np.quantile(df.inc, percentages[1:]).tolist())
bin_edges[-1] += 1 # adding 1 to the last edge as last edge is excluded by np.digitize

groups = [f'{percentages[i]*100:.0f}%-{percentages[i+1]*100:.0f}%' for i in range(len(percentages)-1)]
group_index_to_group_label = dict(zip(range(1, len(bin_edges)+1), groups))

现在我们可以计算CATE了。为此，我们对拟合因果图中的治疗变量进行随机干预，从介入分布中抽取样本，按收入组分组观察，然后计算每个组的治疗效果。

np.random.seed(47)

def estimate_cate():
    samples = gcm.interventional_samples(causal_model,
                                         {treatment_var: lambda x: np.random.choice(['0', '1'])},
                                         observed_data=df)
    eligible = samples[treatment_var] == '1'
    ate = samples[eligible][outcome_var].mean() - samples[~eligible][outcome_var].mean()
    result = dict(ate = ate)

    group_indices = np.digitize(samples['inc'], bin_edges)
    samples['group_index'] = group_indices

    for group_index in group_index_to_group_label:
        group_samples = samples[samples['group_index'] == group_index]
        eligible_in_group = group_samples[treatment_var] == '1'
        cate = group_samples[eligible_in_group][outcome_var].mean() - group_samples[~eligible_in_group][outcome_var].mean()
        result[group_index_to_group_label[group_index]] = cate

    return result

group_to_median, group_to_ci = gcm.confidence_intervals(estimate_cate, num_bootstrap_resamples=100)
print(group_to_median)
print(group_to_ci)

输出结果如下：

{'ate': 6519.046476486404, '0%-20%': 3985.972442541254, '20%-40%': 3109.9999288096888, '40%-60%': 5731.625707624532, '60%-80%': 7605.467796966453, '80%-100%': 11995.55917989574}
{'ate': array([4982.99412698, 8339.97497725]), '0%-20%': array([2630.16909916, 5676.94495668]), '20%-40%': array([1252.7312225 , 5215.15452742]), '40%-60%': array([3533.43542901, 8243.86661569]), '60%-80%': array([ 4726.56666574, 10603.23313684]), '80%-100%': array([ 4981.36999637, 19280.14639468])}

如置信区间所示[4982.99, 8339.97]，401(k)资格对净金融资产的平均处理效果为正。现在，让我们画出不同收入群体的CATEs，以便清楚地了解情况。

fig = plt.figure(figsize=(8,4))
for x, group in enumerate(groups):
    ci = group_to_ci[group]
    plt.plot((x, x), (ci[0], ci[1]), 'ro-', color='orange')
ax = fig.axes[0]
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(range(len(groups)), groups)
plt.xlabel('Income group')
plt.ylabel('ATE of 401(k) eligibility on net financial assets')
plt.show()

效果如下：

因果推断dowhy之-401(k)资格对净金融资产的影响_r语言_03