policy_eval_online: Online/Sequential Policy Evaluation
In polle: Policy Learning

policy_eval_online

R Documentation

Online/Sequential Policy Evaluation

Description

policy_eval_online() is used to estimate the value of a given fixed policy or a data adaptive policy (e.g. a policy learned from the data). policy_eval_online() is also used to estimate the subgroup average treatment effect as defined by the (learned) policy. The evaluation is based on a online/sequential validation estimation scheme making the estimation approach valid for a non-converging policy under no heterogenuous treatment effect (exceptional law), see details.

Usage

policy_eval_online(
  policy_data,
  policy = NULL,
  policy_learn = NULL,
  g_functions = NULL,
  g_models = g_glm(),
  g_full_history = FALSE,
  save_g_functions = TRUE,
  q_functions = NULL,
  q_models = q_glm(),
  q_full_history = FALSE,
  save_q_functions = TRUE,
  c_functions = NULL,
  c_models = NULL,
  c_full_history = FALSE,
  save_c_functions = TRUE,
  m_function = NULL,
  m_model = NULL,
  m_full_history = FALSE,
  save_m_function = TRUE,
  target = "value",
  M = 4,
  train_block_size = get_n(policy_data)/5,
  name = NULL,
  min_subgroup_size = 1
)

Arguments

`policy_data`	Policy data object created by `policy_data()`.
`policy`	Policy object created by `policy_def()`.
`policy_learn`	Policy learner object created by `policy_learn()`.
`g_functions`	Fitted g-model objects, see nuisance_functions. Preferably, use `g_models`.
`g_models`	List of action probability models/g-models for each stage created by `g_empir()`, `g_glm()`, `g_rf()`, `g_sl()` or similar functions. Only used for evaluation if `g_functions` is `NULL`. If a single model is provided and `g_full_history` is `FALSE`, a single g-model is fitted across all stages. If `g_full_history` is `TRUE` the model is reused at every stage.
`g_full_history`	If TRUE, the full history is used to fit each g-model. If FALSE, the state/Markov type history is used to fit each g-model.
`save_g_functions`	If TRUE, the fitted g-functions are saved.
`q_functions`	Fitted Q-model objects, see nuisance_functions. Only valid if the Q-functions are fitted using the same policy. Preferably, use `q_models`.
`q_models`	Outcome regression models/Q-models created by `q_glm()`, `q_rf()`, `q_sl()` or similar functions. Only used for evaluation if `q_functions` is `NULL`. If a single model is provided, the model is reused at every stage.
`q_full_history`	Similar to g_full_history.
`save_q_functions`	Similar to save_g_functions.
`c_functions`	Fitted c-model/censoring probability model objects. Preferably, use `c_models`.
`c_models`	List of right-censoring probability models, see c_model.
`c_full_history`	Similar to g_full_history.
`save_c_functions`	Similar to save_g_functions.
`m_function`	Fitted outcome model object for stage K+1. Preferably, use `m_model`.
`m_model`	Outcome model for the utility at stage K+1. Only used if the final utility contribution is missing/has been right-censored
`m_full_history`	Similar to g_full_history.
`save_m_function`	Similar to save_g_functions.
`target`	Character string. Either "value" or "subgroup". If "value", the target parameter is the policy value. If "subgroup", the target parameter is the subgroup average treatement effect given by the policy, see details. "subgroup" is only implemented for `type = "dr"` in the single-stage case with a dichotomous action set.
`M`	Number of folds for online estimation/sequential validation excluding the initial training block, see details.
`train_block_size`	Integer. Size of the initial training block only used for training of the policy and nuisance models, see details.
`name`	Character string.
`min_subgroup_size`	Minimum number of observations in the evaluated subgroup (Only used if target = "subgroup").

Details

Setup

Each observation has the sequential form

O= {B, U_1, X_1, A_1, ..., U_K, X_K, A_K, U_{K+1}},

for a possibly stochastic number of stages K.

B is a vector of baseline covariates.
U_k is the reward at stage k (not influenced by the action A_k).
X_k is a vector of state covariates summarizing the state at stage k.
A_k is the categorical action within the action set \mathcal{A} at stage k.

The utility is given by the sum of the rewards, i.e., U = \sum_{k = 1}^{K+1} U_k.

A (subgroup) policy is a set of functions

d = \{d_1, ..., d_K\},

where d_k for k\in \{1, ..., K\} maps a subset or function V_1 of \{B, X_1, A_1, ..., A_{k-1}, X_k\} into the action set (or set of subgroups).

Recursively define the Q-models (q_models):

Q^d_K(h_K, a_K) = \mathbb{E}[U|H_K = h_K, A_K = a_K]

Q^d_k(h_k, a_k) = \mathbb{E}[Q_{k+1}(H_{k+1}, d_{k+1}(V_{k+1}))|H_k = h_k, A_k = a_k].

If q_full_history = TRUE, H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}, and if q_full_history = FALSE, H_k = \{B, X_k\}.

The g-models (g_models) are defined as

g_k(h_k, a_k) = \mathbb{P}(A_k = a_k|H_k = h_k).

If g_full_history = TRUE, H_k = \{B, X_1, A_1, ..., A_{k-1}, X_k\}, and if g_full_history = FALSE, H_k = \{B, X_k\}. Furthermore, if g_full_history = FALSE and g_models is a single model, it is assumed that g_1(h_1, a_1) = ... = g_K(h_K, a_K).

Target parameters

If target = "value", policy_eval_online returns the estimates of the value, i.e., the expected potential utility under the policy, (coef):

\mathbb{E}[U^{(d)}]

If target = "subgroup", K = 1, \mathcal{A} = \{0,1\}, and d_1(V_1) \in \{s_1, s_2\} , policy_eval() returns the estimates of the subgroup average treatment effect (coef):

\mathbb{E}[U^{(1)} - U^{(0)}| d_1(\cdot) = s]\quad s\in \{s_1,s_2\},

Online estimation/sequential validation

Estimation of the target parameter is based online estimation/sequential validation using the doubly robust score. The following figure illustrate online estimation using M = 5 steps and an initial training block of size train_block_size = l.

Online estimaiton scheme

Step 1:

The n observations are randomly ordered. In step 1, the first \{1,...,l\} observations, highlighted in teal/blue, are used to fit the Q-models, g-models, the policy (if using the policy_learn argument), and other required models. We denote the collection of these fitted models as P. The remaining observations are split into M blocks of size m = (n-l)/M, which we for simplicity assume to be a whole number. In step 1, the target parameter is estimated using the associated doubly robust score Z(P) evaluated on the first validation fold highlighted in pink \{l+1,...,l+m\}:

\frac{\sum_{i = l+1}^{l+m} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^{l+m} \widehat \sigma_{i}^{-1}},

where \widehat P_i for i \in \{l+1,...,l+m\} refer to the fitted models trained on \{1,...,l\}, and \widehat \sigma_i is the insample estimate for the standard deviation based on the training observations \{1,...,l\}. We will later give an exact expression for \widehat \sigma_i for each target parameter. Note that \widehat \sigma_i is constant for i \in \{l+1,...,l+m\}, but it will be convenient to keep the same index for \widehat \sigma.

Step 2 to M:

In step 2, observations with index \{1,...,l+m\} are used to fit the model collection P, as well as the insample estimate for the standard deviation. For i \in \{l+m+1,...,l+2m\} these are denoted as \widehat P_i, \widehat \sigma_i. This sequential model fitting is repeated for all M steps and the updated online estimator is given by

\frac{\sum_{i = l+1}^{n} {\widehat \sigma_{i}^{-1}} Z(\widehat P_i)(O_i)} {\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}},

with an associated standard error estimate given by

\frac{\left(\frac{1}{n-l}\sum_{i = l+1}^n \widehat \sigma_{i}^{-1}\right)^{-1}}{\sqrt{n-l}}.

Doubly robust scores

target = "value":

For a policy value target the doubly robust score is given by

Z(d, g, Q^d)(O) = Q^d_1(H_1 , d_1(V_1)) + \sum_{r = 1}^K \prod_{j = 1}^{r} \frac{I\{A_j = d_j(\cdot)\}}{g_{j}(H_j, A_j)} \{Q_{r+1}^d(H_{r+1} , d_{r+1}(V_1)) - Q_{r}^d(H_r , d_r(V_1))\}.

The influence function(/curve) for the associated onestep etimator is

Z(d, g, Q^d)(O) - \mathbb{E}[Z(d,g, Q^d)(O)],

which is used to estimate the insample stadard deviation. For example, in step 2, i.e., for i \in \{l+m+1,...,l+2m\}

\widehat \sigma_i^2 = \frac{1}{l+m}\sum_{j=1}^{l+m} \left(Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_j) - \frac{1}{l+m}\sum_{r=1}^{l+m} Z(\widehat d_i,\widehat Q_i,\widehat{g}_i)(O_r) \right)^2

target = "subgroup":

For a subgroup average treatment effect target, where K = 1 (single-stage), \mathcal{A} = \{0,1\} (binary treatment), and d_1(V_1) \in \{s_1, s_2\} (dichotomous subgroup policy) the doubly robust score is given by

Z(d,g,Q, D) = \frac{I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) \Big\}.

Z_1(a, g, Q)(O) = Q_1(H_1 , a) + \frac{I\{A = a\}}{g_1(H_1, a)} \{U - Q_{1}(H_1 , a)\},

where D is \mathbb{P}(d_1(V_1) = s).

The associated onestep/estimating equation estimator has influence function

\frac{ I\{d_1(\cdot) = s\}}{D} \Big\{Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) - E[Z_1(1,g,Q)(O) - Z_1(0,g,Q)(O) | d_1(\cdot) = s]\Big\},

which is used to estimate the standard deviation \widehat \sigma.

Value

policy_eval_online() returns an object of inherited class "policy_eval_online", "policy_eval". The object is a list containing the following elements:

`coef`	Numeric vector. The estimated target parameters: policy value or subgroup average treatment effect.
`vcov`	Numeric vector. The estimated squared standard deviation associated with `coef`.
`target`	Character string. The target parameter ("value" or "subgroup")
`id`	Character vector. The IDs of the observations.
`name`	Character vector. Names for the each element in `coef`.
`train_sequential_index`	list of indexes used for training at each step.
`valid_sequential_index`	list of indexes used for validation at each step.

References

Luedtke, Alexander R, and Mark J van der Laan. “STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY.” Annals of statistics vol. 44,2 (2016): 713-742. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/15-AOS1384")}

polle documentation built on Dec. 1, 2025, 5:08 p.m.