Double Machine Learning(DML) 原理及其应用
- 1. 为什么需要DML?
- 2. DML原理
- 2.1 符号定义
- 2.2 DML训练过程
- 2.3 为什么残差正交化可得到无偏差因果效应?
- 2.4 使用DML估计ATE
- 2.5 使用DML估计CATE
- 2.6 直接预测反事实的Y
- 3. Econml DML应用实战
- 为什么说DML能去偏呢?
2.6 直接预测反事实的Y
在处理非线性CATE时,另一种方案是我们将不再尝试估计CATE的线性近似。相反,我们将做出反事实的预测。 (这种方案在实际中使用也很多,但并没有严格的理论证明!)
3. Econml DML应用实战
- Econml 官方使用示例Double Machine Learning Notebook
- 该案例有非常多小的案例
- Example Usage with Single Continuous Treatment Synthetic Data and Model Selection
- Example Usage with Single Binary Treatment Synthetic Data and Confidence Intervals
- Example Usage with Multiple Continuous Treatment Synthetic Data
- Example Usage with Single Continuous Treatment Observational Data
- Example Usage with Multiple Continuous Treatment, Multiple Outcome Observational Data
- 其中包括了:连续数值的T,二分类01的T,多个T,多个Y
# 连续数值的T
array([ 0.41083324, 0.57893171, 1.08130798, ..., -0.43799495,
1.61941775, 1.64209826])
# 二分类01的T
array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0])
# Multi-T
array([[1.35325451, 1.15373159, 0.46373402],
[1.35325451, 1.15373159, 0.98954119],
[1.35325451, 0.87129337, 0.73716407],
...,
[1.13462273, 1.03673688, 0.46373402],
[1.0260416 , 0.95165788, 0.39877612],
[1.05779029, 0.78390154, 0.55961579]])
# Multi-Y
array([[ 9.01869549, 8.40737833, 9.26482856],
[ 8.72323127, 8.44934252, 8.98719682],
[ 8.25322765, 9.91145572, 8.83171192],
...,
[ 9.93653541, 9.51546936, 9.50599061],
[10.85591733, 9.31793838, 10.9273763 ],
[10.23652533, 11.20553036, 8.85936345]]) - 方法包括:
- LinearDML:默认
- SparseLinearDML:Polynomial Features for Heterogeneity
- DML:Polynomial Features with regularization
- CausalForestDML: Non-Parametric Heterogeneity with Causal Forest
- 数据生成
import econml
## Ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Main imports
from econml.dml import DML, LinearDML, SparseLinearDML, CausalForestDML
# Helper imports
import numpy as np
from itertools import product
from sklearn.linear_model import (Lasso, LassoCV, LogisticRegression,
LogisticRegressionCV,LinearRegression,
MultiTaskElasticNet,MultiTaskElasticNetCV)
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split
%matplotlib inline
# Treatment effect function
def exp_te(x):
return np.exp(2*x[0])
# DGP constants
np.random.seed(123)
n = 2000
n_w = 30
support_size = 5
n_x = 1
# Outcome support
support_Y = np.random.choice(np.arange(n_w), size=support_size, replace=False)
coefs_Y = np.random.uniform(0, 1, size=support_size)
epsilon_sample = lambda n: np.random.uniform(-1, 1, size=n)
# Treatment support
support_T = support_Y
coefs_T = np.random.uniform(0, 1, size=support_size)
eta_sample = lambda n: np.random.uniform(-1, 1, size=n)
# Generate controls, covariates, treatments and outcomes
W = np.random.normal(0, 1, size=(n, n_w))
X = np.random.uniform(0, 1, size=(n, n_x))
# Heterogeneous treatment effects
TE = np.array([exp_te(x_i) for x_i in X])
T = np.dot(W[:, support_T], coefs_T) + eta_sample(n)
Y = TE * T + np.dot(W[:, support_Y], coefs_Y) + epsilon_sample(n)
Y_train, Y_val, T_train, T_val, X_train, X_val, W_train, W_val = train_test_split(Y, T, X, W, test_size=.2)
# Generate test data
X_test = np.array(list(product(np.arange(0, 1, 0.01), repeat=n_x))) - 模型
# 模型
est = LinearDML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
random_state=123)
est.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred = est.effect(X_test)
# 模型
est1 = SparseLinearDML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
featurizer=PolynomialFeatures(degree=3),
random_state=123)
est1.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred1 = est1.effect(X_test)
# 模型
est2 = DML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
model_final=Lasso(alpha=0.1, fit_intercept=False),
featurizer=PolynomialFeatures(degree=10),
random_state=123)
est2.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred2 = est2.effect(X_test)
# 模型
est3 = CausalForestDML(model_y=RandomForestRegressor(),
model_t=RandomForestRegressor(),
criterion='mse', n_estimators=1000,
min_impurity_decrease=0.001,
random_state=123)
est3.tune(Y_train, T_train, X=X_train, W=W_train)
est3.fit(Y_train, T_train, X=X_train, W=W_train)
te_pred3 = est3.effect(X_test)
# 画图
plt.figure(figsize=(10,6))
plt.plot(X_test, te_pred, label='DML default')
plt.plot(X_test, te_pred1, label='DML polynomial degree=3')
plt.plot(X_test, te_pred2, label='DML polynomial degree=10 with Lasso')
plt.plot(X_test, te_pred3, label='ForestDML')
expected_te = np.array([exp_te(x_i) for x_i in X_test])
plt.plot(X_test, expected_te, 'b--', label='True effect')
plt.ylabel('Treatment Effect')
plt.xlabel('x')
plt.legend()
plt.show()
img
- 模型选择
score={}
score["DML default"] = est.score(Y_val, T_val, X_val, W_val)
score["DML polynomial degree=3"] = est1.score(Y_val, T_val, X_val, W_val)
score["DML polynomial degree=10 with Lasso"] = est2.score(Y_val, T_val, X_val, W_val)
score["ForestDML"] = est3.score(Y_val, T_val, X_val, W_val)
mse_te={}
mse_te["DML default"] = ((expected_te - te_pred)**2).mean()
mse_te["DML polynomial degree=3"] = ((expected_te - te_pred1)**2).mean()
mse_te["DML polynomial degree=10 with Lasso"] = ((expected_te - te_pred2)**2).mean()
mse_te["ForestDML"] = ((expected_te - te_pred3)**2).mean()
print("best model selected by MSE of TE: ", min(mse_te, key=lambda x: mse_te.get(x)))
score指的是MSE,越小越好