v.pudms: test a PU fit on a test data set

Description Usage Arguments Value

View source: R/v.pudms.R

Description

test a PU fit on a test data set

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
v.pudms(
  protein_dat,
  py1 = NULL,
  nhyperparam = 10,
  nfolds = 5,
  test_idx = 1:nfolds,
  seed = round(runif(1, min = 1, max = 1000)),
  order = 1,
  refstate = NULL,
  verbose = T,
  nobs_thresh = 10,
  lambda = 0,
  pvalue = FALSE,
  n_eff_prop = 1,
  intercept = F,
  maxit = 1000,
  eps = 0.001,
  inner_eps = 0.01,
  initial_coef = NULL,
  p.adjust.method = "BH",
  tol = 1e-05,
  nCores = 1,
  full.fit = FALSE,
  full.fit.pvalue = FALSE,
  outfile = NULL
)

Arguments

protein_dat

input data. A data table containing (sequence, labeled, unlabeled, seqId)

py1

a numeric value, a numeric vector or NULL; the prevalence of positives in the unlabeled data. If length(py1) >1, optimal py1 will be chosen based on auc values on a test data set. If NULL (default), a sequence of py1 values (of length nhyperparam)–ranging from 0.001 to 0.5 interpolated in a log scale–will be considered.

nhyperparam

an integer for the length of the py1 sequence if py1 == NULL

nfolds

the number of subsamples. (nfolds -1)/nfolds splits will be used for training, and the rest will be used for testing.

test_idx

a vector of indices of cross-validation models which will be fitted. Default is to fit the model for each of the cross-validation fold.

seed

a seed number for reproducibility

order

an integer; 1= main effects, 2= main effects + pairwise effects

refstate

a character which will be used for the common reference state; the default is to use the most frequent amino acid as the reference state for each of the position.

verbose

a logical value. The default is TRUE

nobs_thresh

the number of minimum required mutations per position

lambda

l1 penalty

pvalue

a logial value; if TRUE, p-values based on the asymptotic distribution are obtained

n_eff_prop

proportion of an effective sample size

intercept

a logical value; if TRUE, an estimated intercept is reported together with other coefficients

maxit

maximum number of iterations

eps

convergence threshold for the outer loop

inner_eps

convergence threshold for the inner loop

initial_coef

a vector representing an initial point where we start PUlasso algorithm from.

p.adjust.method

method for multiple comparison

tol

NULL or a numeric value; if the estimated roc curve <= y+tol, the estimated roc curve is determined to be contained by the maximal curve. The default is NULL, where we use tol = 1sd value of the length(test_idx) roc curves at each x value of the estimated roc curve.

nCores

the number of threads for computing.

full.fit

a logical value; if TRUE, the model will be fitted using a full data set and at a chosen py1.

full.fit.pvalue

a logical value; if TRUE, p-values for the full fit will be returned

outfile

NULL or a string; if a string is provided, an output with the name of the string will be exported in a working directory.

Value

a list containing v.dmsfit (all fits using training/test splits), roc_curves (average roc curve at each py1), dmsfit (pudms.fit using a full data set at the selected py1), folds (test/training split information), py1 (a sequence of py1 values used for searching), py1.opt (the selected py1 value based on the predictive performance of the models)


RomeroLab/pudms documentation built on Jan. 2, 2021, 5:10 a.m.