policy_eval_online: Online/Sequential Policy Evaluation

View source: R/policy_eval_online.R

policy_eval_onlineR Documentation

Online/Sequential Policy Evaluation

Description

policy_eval_online() is used to estimate the value of a given fixed policy or a data adaptive policy (e.g. a policy learned from the data). policy_eval_online() is also used to estimate the subgroup average treatment effect as defined by the (learned) policy. The evaluation is based on a online/sequential validation estimation scheme making the estimation approach valid for a non-converging policy under no heterogenuous treatment effect (exceptional law), see details.

Usage

policy_eval_online(
  policy_data,
  policy = NULL,
  policy_learn = NULL,
  g_functions = NULL,
  g_models = g_glm(),
  g_full_history = FALSE,
  save_g_functions = TRUE,
  q_functions = NULL,
  q_models = q_glm(),
  q_full_history = FALSE,
  save_q_functions = TRUE,
  c_functions = NULL,
  c_models = NULL,
  c_full_history = FALSE,
  save_c_functions = TRUE,
  m_function = NULL,
  m_model = NULL,
  m_full_history = FALSE,
  save_m_function = TRUE,
  target = "value",
  M = 4,
  train_block_size = get_n(policy_data)/5,
  name = NULL,
  min_subgroup_size = 1
)

Arguments

policy_data

Policy data object created by policy_data().

policy

Policy object created by policy_def().

policy_learn

Policy learner object created by policy_learn().

g_functions

Fitted g-model objects, see nuisance_functions. Preferably, use g_models.

g_models

List of action probability models/g-models for each stage created by g_empir(), g_glm(), g_rf(), g_sl() or similar functions. Only used for evaluation if g_functions is NULL. If a single model is provided and g_full_history is FALSE, a single g-model is fitted across all stages. If g_full_history is TRUE the model is reused at every stage.

g_full_history

If TRUE, the full history is used to fit each g-model. If FALSE, the state/Markov type history is used to fit each g-model.

save_g_functions

If TRUE, the fitted g-functions are saved.

q_functions

Fitted Q-model objects, see nuisance_functions. Only valid if the Q-functions are fitted using the same policy. Preferably, use q_models.

q_models

Outcome regression models/Q-models created by q_glm(), q_rf(), q_sl() or similar functions. Only used for evaluation if q_functions is NULL. If a single model is provided, the model is reused at every stage.

q_full_history

Similar to g_full_history.

save_q_functions

Similar to save_g_functions.

c_functions

Fitted c-model/censoring probability model objects. Preferably, use c_models.

c_models

List of right-censoring probability models, see c_model.

c_full_history

Similar to g_full_history.

save_c_functions

Similar to save_g_functions.

m_function

Fitted outcome model object for stage K+1. Preferably, use m_model.

m_model

Outcome model for the utility at stage K+1. Only used if the final utility contribution is missing/has been right-censored

m_full_history

Similar to g_full_history.

save_m_function

Similar to save_g_functions.

target

Character string. Either "value" or "subgroup". If "value", the target parameter is the policy value. If "subgroup", the target parameter is the subgroup average treatement effect given by the policy, see details. "subgroup" is only implemented for type = "dr" in the single-stage case with a dichotomous action set.

M

Number of folds for online estimation/sequential validation excluding the initial training block, see details.

train_block_size

Integer. Size of the initial training block only used for training of the policy and nuisance models, see details.

name

Character string.

min_subgroup_size

Minimum number of observations in the evaluated subgroup (Only used if target = "subgroup").

Details

Setup

Each observation has the sequential form

O= {B, U_1, X_1, A_1, ..., U_K, X_K, A_K, U_{K+1}},

for a possibly stochastic number of stages K.

  • B is a vector of baseline covariates.

  • U_k is the reward at stage k (not influenced by the action A_k).

  • X_k is a vector of state covariates summarizing the state at stage k.

  • A_k is the categorical action within the action set \mathcal{A} at stage k.

The utility is given by the sum of the rewards, i.e., U = \sum_{k = 1}^{K+1} U_k.

A (subgroup) policy is a set of functions

d = \{d_1, ..., d_K\},

where d_k for k\in \{1, ..., K\} maps a subset or function V_1 of \{B, X_1, A_1, ..., A_{k-1}, X_k\} into the action set (or set of subgroups).

Recursively define the Q-models (q_models):

Q^d_K(h_K, a_K) = \mathbb{E}[U|H_K = h_K, A_K = a_K]

Q^d_k(h_k, a_k) = \mathbb{E}[Q_{k+1}(H_{k+1}, d_{k+1}(V_{k+1}))|H_k = h_k, A_k = a_k].

If q_full_history = TRUE, H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}, and if q_full_history = FALSE, H_k = \{B, X_k\}.

The g-models (g_models) are defined as

g_k(h_k, a_k) = \mathbb{P}(A_k = a_k|H_k = h_k).

If g_full_history = TRUE, H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}, and if g_full_history = FALSE, H_k = \{B, X_k\}. Furthermore, if g_full_history = FALSE and g_models is a single model, it is assumed that g_1(h_1, a_1) = ... = g_K(h_K, a_K).

Target parameters

If target = "value", policy_eval_online returns the estimates of the value, i.e., the expected potential utility under the policy, (coef):

\mathbb{E}[U^{(d)}]

If target = "subgroup", K = 1, \mathcal{A} = \{0,1\}, and d_1(V_1) \in \{s_1, s_2\} , policy_eval() returns the estimates of the subgroup average treatment effect (coef):

\mathbb{E}[U^{(1)} - U^{(0)}| d_1(\cdot) = s]\quad s\in \{s_1,s_2\},

Online estimation/sequential validation

Estimation of the target parameter is based online estimation/sequential validation using the doubly robust score. The following figure illustrate online estimation using M = 5 steps and an initial training block of size train_block_size = l.

Online estimaiton scheme

Step 1:

The n observations are randomly ordered. In step 1, the first \{1,...,l\} observations, highlighted in teal/blue, are used to fit the Q-models, g-models, the policy (if using the policy_learn argument), and other required models. We denote the collection of these fitted models as P. The remaining observations are split into M blocks of size m = (n-l)/M, which we for simplicity assume to be a whole number. In step 1, the target parameter is estimated using the associated doubly robust score Z(P) evaluated on the first validation fold highlighted in pink \{l+1,...,l+m\}:

\frac{\sum_{i = l+1}^{l+m} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^{l+m} \widehat \sigma_{i}^{-1}},

where \widehat P_i for i \in \{l+1,...,l+m\} refer to the fitted models trained on \{1,...,l\}, and \widehat \sigma_i is the insample estimate for the standard deviation based on the training observations \{1,...,l\}. We will later give an exact expression for \widehat \sigma_i for each target parameter. Note that \widehat \sigma_i is constant for i \in \{l+1,...,l+m\}, but it will be convenient to keep the same index for \widehat \sigma.

Step 2 to M:

In step 2, observations with index \{1,...,l+m\} are used to fit the model collection P, as well as the insample estimate for the standard deviation. For i \in \{l+m+1,...,l+2m\} these are denoted as \widehat P_i, \widehat \sigma_i. This sequential model fitting is repeated for all M steps and the updated online estimator is given by

\frac{\sum_{i = l+1}^{n} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}},

with an associated standard error estimate given by

\frac{\left(\frac{1}{n-l}\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}\right)^{-1}}{\sqrt{n-l}}.

Doubly robust scores

target = "value":

For a policy value target the doubly robust score is given by

Z(d, g, Q^d)(O) = Q^d_1(H_1 , d_1(V_1)) + \sum_{r = 1}^K \prod_{j = 1}^{r} \frac{I\{A_j = d_j(\cdot)\}}{g_{j}(H_j, A_j)} \{Q_{r+1}^d(H_{r+1} , d_{r+1}(V_1)) - Q_{r}^d(H_r , d_r(V_1))\}.

The influence function(/curve) for the associated onestep etimator is

Z(d, g, Q^d)(O) - \mathbb{E}[Z(d,g, Q^d)(O)],

which is used to estimate the insample stadard deviation. For example, in step 2, i.e., for i \in \{l+m+1,...,l+2m\}

\widehat \sigma_i^2 = \frac{1}{l+m}\sum_{j=1}^{l+m} \left(Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_j) - \frac{1}{l+m}\sum_{r=1}^{l+m} Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_r) \right)^2

target = "subgroup":

For a subgroup average treatment effect target, where K = 1 (single-stage), \mathcal{A} = \{0,1\} (binary treatment), and d_1(V_1) \in \{s_1, s_2\} (dichotomous subgroup policy) the doubly robust score is given by

Z(d,g,Q, D) = \frac{I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) \Big\}.

Z_1(a, g, Q)(O) = Q_1(H_1 , a) + \frac{I\{A = a\}}{g_1(H_1, a)} \{U - Q_{1}(H_1 , a)\},

where D is \mathbb{P}(d_1(V_1) = s).

The associated onestep/estimating equation estimator has influence function

\frac{ I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) - E[Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) | d_1(\cdot) = s]\Big\},

which is used to estimate the standard deviation \widehat \sigma.

Value

policy_eval_online() returns an object of inherited class "policy_eval_online", "policy_eval". The object is a list containing the following elements:

coef

Numeric vector. The estimated target parameters: policy value or subgroup average treatment effect.

vcov

Numeric vector. The estimated squared standard deviation associated with coef.

target

Character string. The target parameter ("value" or "subgroup")

id

Character vector. The IDs of the observations.

name

Character vector. Names for the each element in coef.

train_sequential_index

list of indexes used for training at each step.

valid_sequential_index

list of indexes used for validation at each step.

References

Luedtke, Alexander R, and Mark J van der Laan. “STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY.” Annals of statistics vol. 44,2 (2016): 713-742. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/15-AOS1384")}


polle documentation built on Dec. 1, 2025, 5:08 p.m.