CVonestep: Confidence upper bound of true coverage error or threshold...

View source: R/CVonestep.R

CVonestepR Documentation

Confidence upper bound of true coverage error or threshold selection based on cross-fit one-step corrected estimator (via grid search)

Description

Method to compute a confidence upper bound of true coverage or select a threshold (via grid search) for APAC prediction sets based on cross-fit one-step corrected estimators

Usage

CVonestep(
  A,
  X,
  Y,
  scores,
  candidate.tau,
  error.bound = 0.05,
  conf.level = 0.95,
  nfolds = 5,
  g.control = list(SL.library = c("SL.glm", "SL.gam", "SL.randomForest")),
  Q.control = list(SL.library = c("SL.glm", "SL.gam", "SL.randomForest")),
  g.trunc = 0.01,
  select.tau = ifelse(length(candidate.tau) == 1, FALSE, TRUE)
)

Arguments

A

vector of population indicator. 1 for source population, 0 for target population

X

data frame of covariates with each row being one observation

Y

vector of dependent variable/outcome. For data from the target population (A=0), set the corresponding entries of Y to be NA

scores

either a function assigning scores of Y given X trained using an independent dataset from source population or a vector of this function evaluated at observed (X,Y), taking NA for observations from the target population. If it is a function, it must take input (x,y), where x is one row of X (a data frame with one row) and y is a nonmissing value of Y, and output a scalar

candidate.tau

a numeric vector of candidate thresholds, default to c(scores,Inf) (after scores is evaluated at observations if scores is a function).

error.bound

desired bound on the prediction set coverage error between 0 and 1, default 0.05

conf.level

desired level of confidence of low coverage error between 0.5 and 1, default to 0.95

nfolds

number of folds for sample splitting, default to 5

g.control

a named list containing options passed to SuperLearner::SuperLearner to estimate propensity score g. Must not specify Y, X, newX or family. Default to list(SL.library=c("SL.glm","SL.gam","SL.randomForest"))

Q.control

a named list containing options passed to SuperLearner::SuperLearner to estimate conditional coverage error Q. Must not specify Y, X, newX or family. Default to list(SL.library=c("SL.glm","SL.gam","SL.randomForest"))

g.trunc

truncation level of propensity score g from zero, default to 0.01

select.tau

whether to select threshold tau (otherwise just reposrt estimates and confidence upper bounds of coverage error for all candidate.tau), default to TRUE if length(candidate.tau)>1 and FALSE if length(candidate.tau)==1

Value

If select.tau==FALSE, then a list with the following components:

tau

Input tau

error.CI.upper

The (approximate) confidence upper bound of coverage error corresponding to the input tau

error.est

The point estimate of coverage error corresponding to the input tau

Otherwise a list with the following components:

tau

Selected threshold tau, the maximal tau with (approximate) confidence upper bound of coverage error lower than error.bound

error.CI.upper

The (approximate) confidence upper bound of coverage error corresponding to the selected tau

error.est

The point estimate of coverage error corresponding to the selected tau

feasible.tau

The set of feasible thresholds tau defined by (approximate) confidence upper bounds of coverage errors being lower than error.bound

feasible.tau.error.CI.upper

The (approximate) confidence upper bounds of coverage errors corresponding to feasible.tau

feasible.tau.error.est

The point estimates of coverage errors corresponding to feasible.tau

Warnings/Errors due to extreme candidate thresholds

When extremely small/large thresholds are included in candidata.tau, it is common to receive warnings/errors from the machine learning algorithms used by SuperLearner::SuperLearner, because in such cases, almost all Y are included in (for small thresholds) or excluded from (for large thresholds) the corresponding prediction sets, leading to complaints from machine learning algorithms. This is usually not an issue because the resulting predictions are still quite accurate. We also strongly encourage the user to specify a lerner that can deal with such cases (e.g., SL.glm) in Q.control.

Examples

n<-100
expit<-function(x) 1/(1+exp(-x))
A<-rbinom(n,1,.5)
X<-data.frame(X=rnorm(n,sd=ifelse(A==1,1,.5)))
Y<-rbinom(n,1,expit(1+X$X))
scores<-dbinom(Y,1,expit(.08+1.1*X$X))
candidate.tau<-seq(0,.5,length.out=10)
CVonestep(A,X,Y,scores,candidate.tau,nfolds=2,
          g.control=list(SL.library="SL.glm"),
          Q.control=list(SL.library="SL.glm"))

QIU-Hongxiang-David/APACpredset documentation built on Aug. 11, 2022, 12:53 p.m.