RS: Confidence upper bound of true coverage error or threshold...

View source: R/RS.R

RSR Documentation

Confidence upper bound of true coverage error or threshold selection based on Rejection Sampling & Binomial Proportion confidence upper bound

Description

Method to compute a confidence upper bound of true coverage or select a threshold for APAC prediction sets based on Rejection Sampling & Binomial Proportion confidence upper bound

Usage

RS(
  A,
  X,
  Y,
  scores,
  candidate.tau,
  LR.bound = NULL,
  error.bound = 0.05,
  conf.level = 0.95,
  train.prop = 0.5,
  g.control = list(SL.library = c("SL.glm", "SL.gam", "SL.randomForest")),
  Q.control = list(SL.library = c("SL.glm", "SL.gam", "SL.randomForest")),
  g.trunc = 0.01,
  select.tau = ifelse(length(candidate.tau) == 1, FALSE, TRUE)
)

Arguments

A

vector of population indicator. 1 for source population, 0 for target population

X

data frame of covariates with each row being one observation

Y

vector of dependent variable/outcome. For data from the target population (A=0), set the corresponding entries of Y to be NA

scores

either a function assigning scores of Y given X trained using an independent dataset from source population or a vector of this function evaluated at observed (X,Y), taking NA for observations from the target population. If it is a function, it must take input (x,y), where x is one row of X (a data frame with one row) and y is a nonmissing value of Y, and output a scalar

candidate.tau

a numeric vector of candidate thresholds, default to c(scores,Inf) (after scores is evaluated at observations if scores is a function). If candidate.tau has length 1, then just compute the point estimate and confidence upper bound of true coverage error of this threshold.

LR.bound

known upper bound on likelihood ratio between target population and source population. As long as LR.bound is a valid upper bound, smaller values lead to better performance. If is NULL, will use an ad hoc choice, the maximum value of estimated likelihood ratio at observations in the testing data. Default to NULL

error.bound

desired bound on the prediction set coverage error between 0 and 1, default 0.05

conf.level

desired level of confidence of low coverage error between 0.5 and 1, default to 0.95

train.prop

proportion of training data used to estimate nuisance functions, default to 0.5

g.control

a named list containing options passed to SuperLearner::SuperLearner to estimate propensity score g. Must not specify Y, X, newX or family. Default to list(SL.library=c("SL.glm","SL.gam","SL.randomForest"))

Q.control

a named list containing options passed to SuperLearner::SuperLearner to estimate conditional coverage error Q. Must not specify Y, X, newX or family. Default to list(SL.library=c("SL.glm","SL.gam","SL.randomForest"))

g.trunc

truncation level of propensity score g from zero, default to 0.01

select.tau

whether to select threshold tau (otherwise just reposrt estimates and confidence upper bounds of coverage error for all candidate.tau), default to TRUE if length(candidate.tau)>1 and FALSE if length(candidate.tau)==1

Value

If select.tau==FALSE, then a list with the following components:

tau

Input tau

error.CI.upper

The (approximate) confidence upper bound of coverage error corresponding to the input tau

error.est

The point estimate of coverage error corresponding to the input tau

Otherwise a list with the following components:

tau

Selected threshold tau, the maximal tau with (approximate) confidence upper bound of coverage error lower than error.bound

error.CI.upper

The (approximate) confidence upper bound of coverage error corresponding to the selected tau

error.est

The point estimate of coverage error corresponding to the selected tau

feasible.tau

The set of feasible thresholds tau defined by (approximate) confidence upper bounds of coverage errors being lower than error.bound

feasible.tau.error.CI.upper

The (approximate) confidence upper bounds of coverage errors corresponding to feasible.tau

feasible.tau.error.est

The point estimates of coverage errors corresponding to feasible.tau

Warnings/Errors due to extreme candidate thresholds

When extremely small/large thresholds are included in candidata.tau, it is common to receive warnings/errors from the machine learning algorithms used by SuperLearner::SuperLearner, because in such cases, almost all Y are included in (for small thresholds) or excluded from (for large thresholds) the corresponding prediction sets, leading to complaints from machine learning algorithms. This is usually not an issue because the resulting predictions are still quite accurate.

Examples

n<-100
expit<-function(x) 1/(1+exp(-x))
A<-rbinom(n,1,.5)
X<-data.frame(X=rnorm(n,sd=ifelse(A==1,1,.5)))
Y<-rbinom(n,1,expit(1+X$X))
scores<-dbinom(Y,1,expit(.08+1.1*X$X))
candidate.tau<-seq(0,.5,length.out=10)
LR.bound<-4
RS(A,X,Y,scores,candidate.tau,LR.bound,
   g.control=list(SL.library="SL.glm"),
   Q.control=list(SL.library="SL.glm"))

QIU-Hongxiang-David/APACpredset documentation built on Aug. 11, 2022, 12:53 p.m.