Fit models and make predictions with a PCA-LR classifier

Description

These functions are used to apply the generic cross-validation mechanism to a classifier that combines principal component analysis (PCA) with logistic regression (LR).

Usage

1
2
learnRPART(data, status, params, pfun)
predictRPART(newdata, details, status, ...)

Arguments

data

The data matrix, with rows as features ("genes") and columns as the samples to be classified.

status

A factor, with two levels, classifying the samples. The length must equal the number of data columns.

params

A list of additional parameters used by the classifier; see Details.

pfun

The function used to make predictions on new data, using the cross-validated classifier. Should always be set to predictRPART.

newdata

Another data matrix, with the same number of rows as data.

details

A list of additional parameters describing details about the particular classifier; see Details.

...

Optional extra parameters required by the generic "predict" method.

Details

The input arguments to both learnRPART and predictRPART are dictated by the requirements of the general cross-validation mechanism provided by the Modeler-class.

The RPART classifier is similar in spirit to the "supervised principal components" method implemented in the superpc package. We start by performing univariate two-sample t-tests to identify features that are differentially expressed between two groups of training samples. We then set a cutoff to select features using a bound (alpha) on the false discovery rate (FDR). If the number of selected features is smaller than a prespecified goal (minNgenes), then we increase the FDR until we get the desired number of features. Next, we perform PCA on the selected features from the trqining data. we retain enough principal components (PCs) to explain a prespecified fraction of the variance (perVar). We then fit a logistic regression model using these PCs to predict the binary class of the training data. In order to use this model to make binary predictions, you must specify a prior probability that a sample belongs to the first of the two groups (where the ordering is determined by the levels of the classification factor, status).

In order to fit the model to data, the params argument to the learnRPART function should be a list containing components named alpha, minNgenes, perVar, and prior. It may also contain a logical value called verbose, which controls the amount of information that is output as the algorithm runs.

The result of fitting the model using learnRPART is a member of the FittedModel-class. In additon to storing the prediction function (pfun) and the training data and status, the FittedModel stores those details about the model that are required in order to make predictions of the outcome on new data. In this acse, the details are: the prior probability, the set of selected features (sel, a logical vector), the principal component decomposition (spca, an object of the SamplePCA class), the logistic regression model (mmod, of class glm), the number of PCs used (nCompUsed) as well as the number of components available (nCompAvail) and the number of gene-features selected (nGenesSelecets). The details object is appropriate for sending as the second argument to the predictRPART function in order to make predictions with the model on new data. Note that the status vector here is the one used for the training data, since the prediction function only uses the levels of this factor to make sure that the direction of the predicitons is interpreted correctly.

Value

The learnRPART function returns an object of the FittedModel-class, representing a RPART classifier that has been fitted on a training data set.

The predictRPART function returns a factor containing the predictions of the model when applied to the new data set.

Author(s)

Kevin R. Coombes <krc@silicovore.com>

See Also

See Modeler-class and Modeler for details about how to peform cross-validation. See FittedModel-class and FittedModel for details about the structure of the object returned by learnRPART.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# simulate some data
data <- matrix(rnorm(100*20), ncol=20)
status <- factor(rep(c("A", "B"), each=10))

# set up the parameter list
rpart.params <- list(minNgenes=10, alpha=0.10, perVar=0.80, prior=0.5)

# learn the model
fm <- learnRPART(data, status, rpart.params, predictRPART)

# Make predictions on some new simulated data
newdata <- matrix(rnorm(100*30), ncol=30)
predictRPART(newdata, fm@details, status)