Double Machine Learning(DML) 原理及其应用-CFANZ编程社区

Double Machine Learning(DML) 原理及其应用

1. 为什么需要DML？
2. DML原理

2.1 符号定义
2.2 DML训练过程
2.3 为什么残差正交化可得到无偏差因果效应?
2.4 使用DML估计ATE
2.5 使用DML估计CATE
2.6 直接预测反事实的Y

3. Econml DML应用实战

Double Machine Learning(DML) 原理及其应用_python

Double Machine Learning(DML) 原理及其应用_机器学习_02

Double Machine Learning(DML) 原理及其应用_人工智能_03

Double Machine Learning(DML) 原理及其应用_深度学习_04

Double Machine Learning(DML) 原理及其应用_人工智能_05

Double Machine Learning(DML) 原理及其应用_人工智能_06

Double Machine Learning(DML) 原理及其应用_神经网络_07

Double Machine Learning(DML) 原理及其应用_机器学习_08

Double Machine Learning(DML) 原理及其应用_python_09

为什么说DML能去偏呢？

Double Machine Learning(DML) 原理及其应用_python_10

Double Machine Learning(DML) 原理及其应用_神经网络_11

2.6 直接预测反事实的Y

在处理非线性CATE时，另一种方案是我们将不再尝试估计CATE的线性近似。相反，我们将做出反事实的预测。 (这种方案在实际中使用也很多，但并没有严格的理论证明！)

Double Machine Learning(DML) 原理及其应用_人工智能_12

3. Econml DML应用实战

Econml 官方使用示例Double Machine Learning Notebook
该案例有非常多小的案例

Example Usage with Single Continuous Treatment Synthetic Data and Model Selection
Example Usage with Single Binary Treatment Synthetic Data and Confidence Intervals
Example Usage with Multiple Continuous Treatment Synthetic Data
Example Usage with Single Continuous Treatment Observational Data
Example Usage with Multiple Continuous Treatment, Multiple Outcome Observational Data

其中包括了：连续数值的T，二分类01的T，多个T，多个Y
# 连续数值的T
array([ 0.41083324, 0.57893171, 1.08130798, ..., -0.43799495,
1.61941775, 1.64209826])

# 二分类01的T
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0])

# Multi-T
array([[1.35325451, 1.15373159, 0.46373402],
[1.35325451, 1.15373159, 0.98954119],
[1.35325451, 0.87129337, 0.73716407],
...,
[1.13462273, 1.03673688, 0.46373402],
[1.0260416 , 0.95165788, 0.39877612],
[1.05779029, 0.78390154, 0.55961579]])

# Multi-Y
array([[ 9.01869549, 8.40737833, 9.26482856],
[ 8.72323127, 8.44934252, 8.98719682],
[ 8.25322765, 9.91145572, 8.83171192],
...,
[ 9.93653541, 9.51546936, 9.50599061],
[10.85591733, 9.31793838, 10.9273763 ],
[10.23652533, 11.20553036, 8.85936345]])
方法包括：

LinearDML：默认
SparseLinearDML：Polynomial Features for Heterogeneity
DML：Polynomial Features with regularization
CausalForestDML： Non-Parametric Heterogeneity with Causal Forest

数据生成
import econml
## Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Main imports
from econml.dml import DML, LinearDML, SparseLinearDML, CausalForestDML

# Helper imports
import numpy as np
from itertools import product
from sklearn.linear_model import (Lasso, LassoCV, LogisticRegression,
LogisticRegressionCV,LinearRegression,
MultiTaskElasticNet,MultiTaskElasticNetCV)
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split

%matplotlib inline
# Treatment effect function
def exp_te(x):
return np.exp(2*x[0])

# DGP constants
np.random.seed(123)
n = 2000
n_w = 30
support_size = 5
n_x = 1
# Outcome support
support_Y = np.random.choice(np.arange(n_w), size=support_size, replace=False)
coefs_Y = np.random.uniform(0, 1, size=support_size)
epsilon_sample = lambda n: np.random.uniform(-1, 1, size=n)
# Treatment support
support_T = support_Y
coefs_T = np.random.uniform(0, 1, size=support_size)
eta_sample = lambda n: np.random.uniform(-1, 1, size=n)

# Generate controls, covariates, treatments and outcomes
W = np.random.normal(0, 1, size=(n, n_w))
X = np.random.uniform(0, 1, size=(n, n_x))
# Heterogeneous treatment effects
TE = np.array([exp_te(x_i) for x_i in X])
T = np.dot(W[:, support_T], coefs_T) + eta_sample(n)
Y = TE * T + np.dot(W[:, support_Y], coefs_Y) + epsilon_sample(n)

Y_train, Y_val, T_train, T_val, X_train, X_val, W_train, W_val = train_test_split(Y, T, X, W, test_size=.2)
# Generate test data
X_test = np.array(list(product(np.arange(0, 1, 0.01), repeat=n_x)))
模型
# 模型
est = LinearDML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
random_state=123)
est.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred = est.effect(X_test)
# 模型
est1 = SparseLinearDML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
featurizer=PolynomialFeatures(degree=3),
random_state=123)
est1.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred1 = est1.effect(X_test)

# 模型
est2 = DML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
model_final=Lasso(alpha=0.1, fit_intercept=False),
featurizer=PolynomialFeatures(degree=10),
random_state=123)
est2.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred2 = est2.effect(X_test)
# 模型
est3 = CausalForestDML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
criterion='mse', n_estimators=1000,
min_impurity_decrease=0.001,
random_state=123)
est3.tune(Y_train, T_train, X=X_train, W=W_train)
est3.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred3 = est3.effect(X_test)

# 画图
plt.figure(figsize=(10,6))
plt.plot(X_test, te_pred, label='DML default')
plt.plot(X_test, te_pred1, label='DML polynomial degree=3')
plt.plot(X_test, te_pred2, label='DML polynomial degree=10 with Lasso')
plt.plot(X_test, te_pred3, label='ForestDML')
expected_te = np.array([exp_te(x_i) for x_i in X_test])
plt.plot(X_test, expected_te, 'b--', label='True effect')
plt.ylabel('Treatment Effect')
plt.xlabel('x')
plt.legend()
plt.show()

Double Machine Learning(DML) 原理及其应用_人工智能_13

img

模型选择
score={}
score["DML default"] = est.score(Y_val, T_val, X_val, W_val)
score["DML polynomial degree=3"] = est1.score(Y_val, T_val, X_val, W_val)
score["DML polynomial degree=10 with Lasso"] = est2.score(Y_val, T_val, X_val, W_val)
score["ForestDML"] = est3.score(Y_val, T_val, X_val, W_val)

mse_te={}
mse_te["DML default"] = ((expected_te - te_pred)**2).mean()
mse_te["DML polynomial degree=3"] = ((expected_te - te_pred1)**2).mean()
mse_te["DML polynomial degree=10 with Lasso"] = ((expected_te - te_pred2)**2).mean()
mse_te["ForestDML"] = ((expected_te - te_pred3)**2).mean()

print("best model selected by MSE of TE: ", min(mse_te, key=lambda x: mse_te.get(x)))

score指的是MSE，越小越好