0
点赞
收藏
分享

微信扫一扫

因果推断dowhy之-401(k)资格对净金融资产的影响


0x01. 问题背景

在本案例研究中,我们将使用来自401(k)分析的真实数据来解释如何使用因果库来估计平均治疗效果(ATE)和条件ATE (CATE)。

本案例数据来自真实数据。在20世纪80年代初,美国政府为雇员推出了几种税收递延储蓄选择,以增加个人退休储蓄。一个受欢迎的选择是401(k)计划,该计划允许员工将工资的一部分存入个人账户。考虑到由于个人特征(特别是收入)造成的差异性,这里的目标是了解401(k)资格对净金融资产(即401(k)余额和非401(k)资产的总和)的影响。

由于401(k)计划是由雇主提供的,因此只有提供这些计划的公司的员工才有资格参加。因此,我们正在处理一项非随机研究。有几个因素(如教育程度、储蓄偏好)会影响401(k)计划的资格以及净金融资产。

0x02. 数据

我们考虑的样本来自1991年的收入和项目参与调查。样本由家庭组成,其中参考个人年龄在25-64岁之间,至少有一人受雇,但没有人是自雇的。样本中有9915户的记录。对于每个家庭,记录了44个变量,包括家庭参考人参加401(k)计划的资格(治疗)、净金融资产(结果)和其他协变量,如年龄、收入、家庭规模、教育程度、婚姻状况等。我们特别考虑了16个协变量。部分变量总结如下:

Variable Name

Type

Details

e401

Treatment

eligibility for the 401(k) plan

net_tfa

Outcome

net financial assets (in USD)

age

Covariate

Age

inc

Covariate

income (in USD)

fsize

Covariate

family size

educ

Covariate

education (in years)

male

Covariate

is a male?

db

Covariate

defined benefit pension

marr

Covariate

married?

twoearn

Covariate

two earners

pira

Covariate

participation in IRA

hown

Covariate

home owner?

hval

Covariate

home value (in USD)

hequity

Covariate

home equity (in USD)

hmort

Covariate

home mortgage (in USD)

nohs

Covariate

no high-school? (one-hot encoded)

hs

Covariate

high-school? (one-hot encoded)

smcol

Covariate

some-college? (one-hot encoded)

该数据集可从’ hdm https://rdrr.io/cran/hdm/man/pension.html ’ __ R包中在线公开获得。为了更方便的做实验,经过一系列的实验,将数据下载至本地进行实验。具体如何搞到的数据,参考整理的博客:hdm数据R语言获取教程

0x03. 实验

0x03_1. 读取数据

import pandas as pd
df = pd.read_csv("data/pension.csv")
df.head()

数据结果如下:

ira

a401

hval

hmort

hequity

nifa

net_nifa

tfa

net_tfa

tfa_he

tw

age

inc

fsize

educ

db

marr

male

twoearn

dum91

e401

p401

pira

nohs

hs

smcol

col

icat

ecat

zhat

net_n401

hown

i1

i2

i3

i4

i5

i6

i7

a1

a2

a3

a4

a5

0

0

0

69000

60150

8850

100

-3300

100

-3300

5550

53550

31

28146

5

12

0

1

0

0

1

0

0

0

0

1

0

0

3

2

0.273178

-3300

1

0

0

1

0

0

0

0

0

1

0

0

0

1

0

0

78000

20000

58000

61010

61010

61010

61010

119010

124635

52

32634

5

16

0

0

0

0

1

0

0

0

0

0

0

1

4

4

0.386641

61010

1

0

0

0

1

0

0

0

0

0

0

1

0

2

1800

0

200000

15900

184100

7549

7049

9349

8849

192949

192949

50

52206

3

11

0

1

1

1

1

0

0

1

1

0

0

0

6

1

0.533650

8849

1

0

0

0

0

0

1

0

0

0

0

1

0

3

0

0

0

0

0

2487

-6013

2487

-6013

-6013

-513

28

45252

4

15

0

1

0

1

1

0

0

0

0

0

1

0

5

3

0.324319

-6013

0

0

0

0

0

1

0

0

1

0

0

0

0

4

0

0

300000

90000

210000

10625

-2375

10625

-2375

207625

212087

42

33126

3

12

1

0

0

0

1

0

0

0

0

1

0

0

4

2

0.602807

-2375

1

0

0

0

1

0

0

0

0

0

1

0

0

0x03_2. 构建因果图

使用e401作为Treatment,net_tfa作为out_come,其他协变量作为confounder,构建因果图代码如下:

import networkx as nx
import dowhy.gcm as gcm

treatment_var = "e401"
outcome_var = "net_tfa"
covariates = ["age","inc","fsize","educ","male","db",
              "marr","twoearn","pira","hown","hval",
              "hequity","hmort","nohs","hs","smcol"]

edges = [(treatment_var, outcome_var)]
edges.extend([(covariate, treatment_var) for covariate in covariates])
edges.extend([(covariate, outcome_var) for covariate in covariates])

causal_graph = nx.DiGraph(edges)
gcm.util.plot(causal_graph, figure_size=[20, 20])

效果图如下:

因果推断dowhy之-401(k)资格对净金融资产的影响_开发语言

0x03_3. 数据分析

在我们为变量分配因果模型之前,让我们绘制它们的直方图,以了解变量的分布。

import matplotlib.pyplot as plt

cols = [treatment_var, outcome_var]
cols.extend(covariates)
plt.figure(figsize=(10,5))
for i, col in enumerate(cols):
    plt.subplot(3,6,i+1)
    plt.grid(False)
    plt.hist(df[col])
    plt.xlabel(col)
plt.tight_layout()
plt.show()

绘制结果如下

因果推断dowhy之-401(k)资格对净金融资产的影响_拟合_02


我们观察到,实值变量不遵循众所周知的参数分布,如高斯分布。因此,当这些变量没有父变量时,我们拟合经验分布,这也适用于分类变量。

0x03_4. 数据增加噪声增强鲁棒性

让我们将因果模型分配给变量。对于treatment变量,我们分配了一个带有随机森林分类器的分类器功能因果模型(FCM)。对于outcome变量,我们分配了一个加性噪声模型,其中随机森林回归作为函数和噪声的经验分布。我们将经验分布分配给其他变量,因为它们在因果图中没有父节点。

causal_model = gcm.StructuralCausalModel(causal_graph)
causal_model.set_causal_mechanism(treatment_var, gcm.ClassifierFCM(gcm.ml.create_random_forest_classifier()))
causal_model.set_causal_mechanism(outcome_var, gcm.AdditiveNoiseModel(gcm.ml.create_random_forest_regressor()))
for covariate in covariates:
    causal_model.set_causal_mechanism(covariate, gcm.EmpiricalDistribution())

为了适合分类器FCM,我们将处理列转换为字符串类型。

df = df.astype({treatment_var: str})

0x03_5. 从数据中拟合因果模型

gcm.fit(causal_model, df)

输出如下:

Fitting causal mechanism of node smcol: 100%|██████████| 18/18 [00:06<00:00,  2.68it/s]

在计算CATE之前,我们首先将家庭划分为收入百分位数的等宽箱(equi-width bins)。这使我们能够研究对不同收入群体的影响。

import numpy as np

percentages = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
bin_edges = [0]
bin_edges.extend(np.quantile(df.inc, percentages[1:]).tolist())
bin_edges[-1] += 1 # adding 1 to the last edge as last edge is excluded by np.digitize

groups = [f'{percentages[i]*100:.0f}%-{percentages[i+1]*100:.0f}%' for i in range(len(percentages)-1)]
group_index_to_group_label = dict(zip(range(1, len(bin_edges)+1), groups))

现在我们可以计算CATE了。为此,我们对拟合因果图中的治疗变量进行随机干预,从介入分布中抽取样本,按收入组分组观察,然后计算每个组的治疗效果。

np.random.seed(47)

def estimate_cate():
    samples = gcm.interventional_samples(causal_model,
                                         {treatment_var: lambda x: np.random.choice(['0', '1'])},
                                         observed_data=df)
    eligible = samples[treatment_var] == '1'
    ate = samples[eligible][outcome_var].mean() - samples[~eligible][outcome_var].mean()
    result = dict(ate = ate)

    group_indices = np.digitize(samples['inc'], bin_edges)
    samples['group_index'] = group_indices

    for group_index in group_index_to_group_label:
        group_samples = samples[samples['group_index'] == group_index]
        eligible_in_group = group_samples[treatment_var] == '1'
        cate = group_samples[eligible_in_group][outcome_var].mean() - group_samples[~eligible_in_group][outcome_var].mean()
        result[group_index_to_group_label[group_index]] = cate

    return result

group_to_median, group_to_ci = gcm.confidence_intervals(estimate_cate, num_bootstrap_resamples=100)
print(group_to_median)
print(group_to_ci)

输出结果如下:

{'ate': 6519.046476486404, '0%-20%': 3985.972442541254, '20%-40%': 3109.9999288096888, '40%-60%': 5731.625707624532, '60%-80%': 7605.467796966453, '80%-100%': 11995.55917989574}
{'ate': array([4982.99412698, 8339.97497725]), '0%-20%': array([2630.16909916, 5676.94495668]), '20%-40%': array([1252.7312225 , 5215.15452742]), '40%-60%': array([3533.43542901, 8243.86661569]), '60%-80%': array([ 4726.56666574, 10603.23313684]), '80%-100%': array([ 4981.36999637, 19280.14639468])}

如置信区间所示[4982.99, 8339.97],401(k)资格对净金融资产的平均处理效果为正。 现在,让我们画出不同收入群体的CATEs,以便清楚地了解情况。

fig = plt.figure(figsize=(8,4))
for x, group in enumerate(groups):
    ci = group_to_ci[group]
    plt.plot((x, x), (ci[0], ci[1]), 'ro-', color='orange')
ax = fig.axes[0]
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(range(len(groups)), groups)
plt.xlabel('Income group')
plt.ylabel('ATE of 401(k) eligibility on net financial assets')
plt.show()

效果如下:

因果推断dowhy之-401(k)资格对净金融资产的影响_r语言_03


当一个人从低收入群体向高收入群体移动时,这种影响就会增加。这一结果似乎与不同收入群体的资源约束相一致。


举报

相关推荐

0 条评论