View source: R/policy_eval_online.R
| policy_eval_online | R Documentation |
policy_eval_online() is used to estimate
the value of a given fixed policy
or a data adaptive policy (e.g. a policy
learned from the data). policy_eval_online()
is also used to estimate the subgroup average
treatment effect as defined by the (learned) policy.
The evaluation is based on a online/sequential validation
estimation scheme making the estimation approach valid for a
non-converging policy under no heterogenuous treatment effect
(exceptional law), see details.
policy_eval_online(
policy_data,
policy = NULL,
policy_learn = NULL,
g_functions = NULL,
g_models = g_glm(),
g_full_history = FALSE,
save_g_functions = TRUE,
q_functions = NULL,
q_models = q_glm(),
q_full_history = FALSE,
save_q_functions = TRUE,
c_functions = NULL,
c_models = NULL,
c_full_history = FALSE,
save_c_functions = TRUE,
m_function = NULL,
m_model = NULL,
m_full_history = FALSE,
save_m_function = TRUE,
target = "value",
M = 4,
train_block_size = get_n(policy_data)/5,
name = NULL,
min_subgroup_size = 1
)
policy_data |
Policy data object created by |
policy |
Policy object created by |
policy_learn |
Policy learner object created by |
g_functions |
Fitted g-model objects, see nuisance_functions.
Preferably, use |
g_models |
List of action probability models/g-models for each stage
created by |
g_full_history |
If TRUE, the full history is used to fit each g-model. If FALSE, the state/Markov type history is used to fit each g-model. |
save_g_functions |
If TRUE, the fitted g-functions are saved. |
q_functions |
Fitted Q-model objects, see nuisance_functions.
Only valid if the Q-functions are fitted using the same policy.
Preferably, use |
q_models |
Outcome regression models/Q-models created by
|
q_full_history |
Similar to g_full_history. |
save_q_functions |
Similar to save_g_functions. |
c_functions |
Fitted c-model/censoring probability model objects. Preferably, use |
c_models |
List of right-censoring probability models, see c_model. |
c_full_history |
Similar to g_full_history. |
save_c_functions |
Similar to save_g_functions. |
m_function |
Fitted outcome model object for stage K+1. Preferably, use |
m_model |
Outcome model for the utility at stage K+1. Only used if the final utility contribution is missing/has been right-censored |
m_full_history |
Similar to g_full_history. |
save_m_function |
Similar to save_g_functions. |
target |
Character string. Either "value" or "subgroup". If "value",
the target parameter is the policy value.
If "subgroup", the target parameter
is the subgroup average treatement effect given by the policy, see details.
"subgroup" is only implemented for |
M |
Number of folds for online estimation/sequential validation excluding the initial training block, see details. |
train_block_size |
Integer. Size of the initial training block only used for training of the policy and nuisance models, see details. |
name |
Character string. |
min_subgroup_size |
Minimum number of observations in the evaluated subgroup (Only used if target = "subgroup"). |
Each observation has the sequential form
O= {B, U_1, X_1, A_1, ..., U_K, X_K, A_K, U_{K+1}},
for a possibly stochastic number of stages K.
B is a vector of baseline covariates.
U_k is the reward at stage k
(not influenced by the action A_k).
X_k is a vector of state
covariates summarizing the state at stage k.
A_k is the categorical action
within the action set \mathcal{A} at stage k.
The utility is given by the sum of the rewards, i.e.,
U = \sum_{k = 1}^{K+1} U_k.
A (subgroup) policy is a set of functions
d = \{d_1, ..., d_K\},
where d_k for k\in \{1, ..., K\}
maps a subset or function V_1 of \{B, X_1, A_1, ..., A_{k-1}, X_k\} into the
action set (or set of subgroups).
Recursively define the Q-models (q_models):
Q^d_K(h_K, a_K) = \mathbb{E}[U|H_K = h_K, A_K = a_K]
Q^d_k(h_k, a_k) = \mathbb{E}[Q_{k+1}(H_{k+1},
d_{k+1}(V_{k+1}))|H_k = h_k, A_k = a_k].
If q_full_history = TRUE,
H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}, and if
q_full_history = FALSE, H_k = \{B, X_k\}.
The g-models (g_models) are defined as
g_k(h_k, a_k) = \mathbb{P}(A_k = a_k|H_k = h_k).
If g_full_history = TRUE,
H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}, and if
g_full_history = FALSE, H_k = \{B, X_k\}.
Furthermore, if g_full_history = FALSE and g_models is a
single model, it is assumed that g_1(h_1, a_1) = ... = g_K(h_K, a_K).
If target = "value", policy_eval_online
returns the estimates of
the value, i.e., the expected potential utility under the policy, (coef):
\mathbb{E}[U^{(d)}]
If target = "subgroup", K = 1, \mathcal{A} = \{0,1\},
and d_1(V_1) \in \{s_1, s_2\} , policy_eval()
returns the estimates of the subgroup average
treatment effect (coef):
\mathbb{E}[U^{(1)} - U^{(0)}| d_1(\cdot) = s]\quad s\in \{s_1,s_2\},
Estimation of the target parameter is based online estimation/sequential
validation using the doubly robust score. The following figure illustrate
online estimation using M = 5 steps and an initial training block of
size train_block_size = l.
Step 1:
The n observations are randomly ordered. In step 1,
the first \{1,...,l\} observations, highlighted in teal/blue, are used to fit the
Q-models, g-models, the policy (if using the policy_learn argument), and other required models.
We denote the collection of these fitted models as P.
The remaining observations are split into M blocks of size m = (n-l)/M, which
we for simplicity assume to be a whole number. In step 1, the target
parameter is estimated using the associated doubly robust score Z(P)
evaluated on the first validation fold
highlighted in pink \{l+1,...,l+m\}:
\frac{\sum_{i = l+1}^{l+m} {\widehat \sigma_{i}^{-1}}
Z(\widehat P_i)(O_i)}
{\sum_{i = l+1}^{l+m} \widehat \sigma_{i}^{-1}},
where \widehat P_i for i \in \{l+1,...,l+m\}
refer to the fitted models trained on \{1,...,l\}, and \widehat \sigma_i
is the insample estimate for the standard deviation based on the training observations \{1,...,l\}.
We will later give an exact expression for \widehat \sigma_i for each target parameter.
Note that \widehat \sigma_i is constant for i \in \{l+1,...,l+m\}, but it will be
convenient to keep the same index for \widehat \sigma.
Step 2 to M:
In step 2, observations with index \{1,...,l+m\} are used to fit the model collection P,
as well as the insample estimate for the standard deviation. For i \in \{l+m+1,...,l+2m\} these are
denoted as \widehat P_i, \widehat \sigma_i.
This sequential model fitting is repeated for all M
steps and the updated online estimator is given by
\frac{\sum_{i = l+1}^{n} {\widehat \sigma_{i}^{-1}}
Z(\widehat P_i)(O_i)}
{\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}},
with an associated standard error estimate given by
\frac{\left(\frac{1}{n-l}\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}\right)^{-1}}{\sqrt{n-l}}.
target = "value":
For a policy value target the doubly robust score is given by
Z(d, g, Q^d)(O) = Q^d_1(H_1 , d_1(V_1)) +
\sum_{r = 1}^K \prod_{j = 1}^{r}
\frac{I\{A_j = d_j(\cdot)\}}{g_{j}(H_j, A_j)}
\{Q_{r+1}^d(H_{r+1} , d_{r+1}(V_1)) - Q_{r}^d(H_r , d_r(V_1))\}.
The influence function(/curve) for the associated onestep etimator is
Z(d, g, Q^d)(O) - \mathbb{E}[Z(d,g, Q^d)(O)],
which is used to estimate the insample stadard deviation. For example,
in step 2, i.e., for i \in \{l+m+1,...,l+2m\}
\widehat \sigma_i^2 = \frac{1}{l+m}\sum_{j=1}^{l+m} \left(Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_j) - \frac{1}{l+m}\sum_{r=1}^{l+m} Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_r) \right)^2
target = "subgroup":
For a subgroup average treatment effect target,
where K = 1 (single-stage),
\mathcal{A} = \{0,1\} (binary treatment), and
d_1(V_1) \in \{s_1, s_2\} (dichotomous subgroup policy) the
doubly robust score is given by
Z(d,g,Q, D) = \frac{I\{d_1(\cdot) = s\}}{D}
\Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) \Big\}.
Z_1(a, g, Q)(O) = Q_1(H_1 , a) +
\frac{I\{A = a\}}{g_1(H_1, a)}
\{U - Q_{1}(H_1 , a)\},
where D is \mathbb{P}(d_1(V_1) = s).
The associated onestep/estimating equation estimator has influence function
\frac{ I\{d_1(\cdot) = s\}}{D}
\Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) - E[Z_1(1,g,Q)(O)
- Z_1(0,g,Q)(O) | d_1(\cdot) = s]\Big\},
which is used to estimate the standard deviation \widehat \sigma.
policy_eval_online() returns an object of inherited class "policy_eval_online", "policy_eval".
The object is a list containing the following elements:
coef |
Numeric vector. The estimated target parameters: policy value or subgroup average treatment effect. |
vcov |
Numeric vector. The estimated squared standard deviation associated with
|
target |
Character string. The target parameter ("value" or "subgroup") |
id |
Character vector. The IDs of the observations. |
name |
Character vector. Names for the each element in |
train_sequential_index |
list of indexes used for training at each step. |
valid_sequential_index |
list of indexes used for validation at each step. |
Luedtke, Alexander R, and Mark J van der Laan. “STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY.” Annals of statistics vol. 44,2 (2016): 713-742. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/15-AOS1384")}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.