ZIPDA.cv: Function to do cross-validation for zero-inflated Poisson...

Description Usage Arguments Value Examples

View source: R/ZIPDA.cv.R

Description

Perform cross-validation for the function that implements the "sparse zero-inflated Poisson linear discriminant analysis classifier",which is similar to linear discriminant analysis but assumes a zero-inflated Poisson model rather than a Gaussian model for the data. The classifies soft-thresholds the estimated effect of each feature in order to achieve sparsity. This cross-validation function selects the proper value of the tuning parameter that controls the level of soft-thresholding.

Usage

1
2
3
ZIPDA.cv(x, y, rhos = NULL, beta = 1, nfolds = 5, prob0=NULL,
type=c("mle","deseq","quantile"),folds = NULL, transform=TRUE, alpha=NULL,
prior=NULL)

Arguments

x

A n-by-p training data matrix; n observations and p features.

y

A numeric vector of class labels of length n: 1, 2, ...., K if there are K classes.Each element of y corresponds to a row of x; i.e. these are the class labels for the observationsin x.

rhos

A vector of tuning parameters to try out in cross-validation. Rho controls the level of shrinkage performed, i.e. the number of features that are not involved in the classifier. When rho=0 then all features are involved in the classifier, and when rho is very large no features are involved. If rhos=NULL then a vector of rho values will be chosen automatically.

beta

A smoothing term. A Gamma(beta,beta) prior is used to fit the zero-inflated Poisson model.Recommendation is to leave it at 1, the default value.

nfolds

The number of folds in the cross-validation; default is 5-fold cross-validation.

prob0

The probability that the read is 0

type

How should the observations be normalized within the zero-inflated Poisson model, i.e. how should the size factors be estimated? Options are "quantile" or "deseq" (more robust) or "mle" (less robust). In greater detail: "quantile" is quantile normalization approach of Bullard et al 2010 BMC Bioinformatics, "deseq" is median of the ratio of an observation to a pseudoreference obtained by taking the geometric mean, described in Anders and Huber 2010 Genome Biology and implemented in Bioconductor package "DESeq", and "mle" is the sum of counts for each sample; this is the maximum likelihood estimate under a simple Poisson model.

folds

Instead of specifying the number of folds in cross-validation, one can explicitly specify the folds. To do this, input a list of length r(to perform r-fold cross-validation). The rth element of the list should be vector containing the indices of the test observations in the rth fold.

transform

Should data matrices x and xte first be power transformed so that it more closely fits the zero-inflated Poisson model? TRUE or FALSE. Power transformation is especially useful if the data are overdispersed relative to the zero-inflated Poisson model.

alpha

If transform=TRUE, this determines the power to which the data matrices x and xte are transformed. If alpha=NULL then the transformation that makes the zero-inflated Poisson model best fit the data matrix x is computed. (Note that alpha is computed based on x, not based on xte). Or a value of alpha, 0<alpha<=1, can be entered by the user.

prior

Vector of length equal to the number of classes, representing prior probabilities for each class. If NULL then uniform priors are used (i.e.each class is equally likely).

Value

list(.) A list of output, "errs" represents A matrix of dimension (number of folds)-by-(length of rhos)."bestrho" represents The tuning parameter value resulting in the lowest overall cross-validation error rate for. "rhos" represent the vector of rho values used in cross-validation."nnonzero" represents A matrix of dimension (number of folds)-by-(length of rhos)."folds" represents Cross-validation folds used. "alpha" represents Power transformation used (if transform=TRUE).

Examples

1
2
3
4
5
6
7
8
9
library(SummarizedExperiment)
dat <- newCountDataSet(n=40,p=500, K=4, param=10, sdsignal=0.1,drate=0.4)
x <- t(assay(dat$sim_train_data))
y <- as.numeric(colnames(dat$sim_train_data))
xte <- t(assay(dat$sim_test_data))
prob<-estimatep(x=x, y=y, xte=x, beta=1, type="mle", prior=NULL)
prob0<-estimatep(x=x, y=y, xte=xte, beta=1,type="mle", prior=NULL)
cv.out <- ZIPDA.cv(x=x, y=y, prob0=t(prob))
out <- ZIPLDA(x=x, y=y, xte=xte, rho=cv.out$bestrho, prob0=t(prob0))

zhangli1109/ENTC documentation built on Nov. 10, 2020, 11:16 p.m.