knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(eval = TRUE)

The purpose of the following case-studies is to demonstrate the core functionalities of `DoubleML`

.

The **DoubleML** package for R can be downloaded using (requires previous installation of the `remotes`

package).

remotes::install_github("DoubleML/doubleml-for-r")

Load the package after completed installation.

```
library(DoubleML)
```

The python package `DoubleML`

is available via the github repository. For more information, please visit our user guide.

For our case study we download the Bonus data set from the Pennsylvania Reemployment Bonus experiment and as a second example we simulate data from a partially linear regression model.

library(DoubleML) # Load bonus data df_bonus = fetch_bonus(return_type="data.table") head(df_bonus) # Simulate data set.seed(3141) n_obs = 500 n_vars = 100 theta = 3 X = matrix(rnorm(n_obs*n_vars), nrow=n_obs, ncol=n_vars) d = X[,1:3]%*%c(5,5,5) + rnorm(n_obs) y = theta*d + X[, 1:3]%*%c(5,5,5) + rnorm(n_obs)

\begin{align*}
Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}(\zeta | D,X) = 0, \
D = m_0(X) + V, & &\mathbb{E}(V | X) = 0,
\end{align*}
where $Y$ is the outcome variable and $D$ is the policy variable of interest.
The high-dimensional vector $X = (X_1, \ldots, X_p)$ consists of other confounding covariates,
and $\zeta$ and $V$ are stochastic errors.

`DoubleMLData`

`DoubleML`

provides interfaces to objects of class `data.table`

as well as R base classes `data.frame`

and `matrix`

. Details on the data-backend and the interfaces can be found in the user guide. The `DoubleMLData`

class serves as data-backend and can be initialized from a dataframe by specifying the column `y_col="inuidur1"`

serving as outcome variable $Y$, the column(s) `d_cols = "tg"`

serving as treatment variable $D$ and the columns `x_cols=c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")`

specifying the confounders. Alternatively a matrix interface can be used as shown below for the simulated data.

# Specify the data and variables for the causal model dml_data_bonus = DoubleMLData$new(df_bonus, y_col = "inuidur1", d_cols = "tg", x_cols = c("female", "black", "othrace", "dep1", "dep2", "q2", "q3", "q4", "q5", "q6", "agelt35", "agegt54", "durable", "lusd", "husd")) print(dml_data_bonus) # matrix interface to DoubleMLData dml_data_sim = double_ml_data_from_matrix(X = X, y = y, d = d) dml_data_sim

To estimate our partially linear regression (PLR) model with the double machine learning algorithm, we first have to specify machine learners to estimate $m_0$ and $g_0$. For the bonus data we use a random forest regression model and for our simulated data from a sparse partially linear model we use a Lasso regression model. The implementation of `DoubleML`

is based on the meta-packages mlr3 for R. For details on the specification of learners and their hyperparameters we refer to the user guide Learners, hyperparameters and hyperparameter tuning.

library(mlr3) library(mlr3learners) # surpress messages from mlr3 package during fitting lgr::get_logger("mlr3")$set_threshold("warn") learner = lrn("regr.ranger", num.trees = 500, max.depth = 5, min.node.size = 2) ml_l_bonus = learner$clone() ml_m_bonus = learner$clone() learner = lrn("regr.glmnet", lambda = sqrt(log(n_vars)/(n_obs))) ml_l_sim = learner$clone() ml_m_sim = learner$clone()

When initializing the object for PLR models `DoubleMLPLR`

, we can further set parameters specifying the resampling:

- The number of folds used for cross-fitting
`n_folds`

(defaults to`n_folds = 5`

) as well as - the number of repetitions when applying repeated cross-fitting
`n_rep`

(defaults to`n_rep = 1`

).

Additionally, one can choose between the algorithms `"dml1"`

and `"dml2"`

via `dml_procedure`

(defaults to `"dml2"`

). Depending on the causal model, one can further choose between different Neyman-orthogonal score / moment functions. For the PLR model the default score is `"partialling out"`

, i.e.,
\begin{align}\begin{aligned}\psi(W; \theta, \eta) &:= [Y - \ell(X) - \theta (D - m(X))] [D - m(X)].\end{aligned}\end{align}

Note that with this score, we do not estimate $g_0(X)$ directly, but the conditional expectation of $Y$ given $X$, $\ell_0(X) = E[Y|X]$. The user guide provides details about the Sample-splitting, cross-fitting and repeated cross-fitting, the Double machine learning algorithms and the Score functions

We now initialize `DoubleMLPLR`

objects for our examples using default parameters. The models are estimated by calling the `fit()`

method and we can for example inspect the estimated treatment effect using the `summary()`

method. A more detailed result summary can be obtained via the `print()`

method. Besides the `fit()`

method `DoubleML`

model classes also provide functionalities to perform statistical inference like `bootstrap()`

, `confint()`

and `p_adjust()`

, for details see the user guide Variance estimation, confidence intervals and boostrap standard errors.

set.seed(3141) obj_dml_plr_bonus = DoubleMLPLR$new(dml_data_bonus, ml_l = ml_l_bonus, ml_m = ml_m_bonus) obj_dml_plr_bonus$fit() print(obj_dml_plr_bonus) obj_dml_plr_sim = DoubleMLPLR$new(dml_data_sim, ml_l = ml_l_sim, ml_m = ml_m_sim) obj_dml_plr_sim$fit() print(obj_dml_plr_sim)

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.