LPStraining: LPS training

View source: R/LPStraining.R

LPStrainingR Documentation

LPS training

Description

This is the wrap up function to select top features, estimate parameters, and calculate LPS (Linear Predictor Score) scores based on a given training data set.

Usage

LPStraining(
  trainDat,
  standardization = FALSE,
  selectedTraits = NULL,
  groupInfo,
  refGroup = NULL,
  topN = NULL,
  FDRcut = NULL,
  weightMethod = c("ttest", "limma", "PearsonR", "SpearmanR", "MannWhitneyU"),
  classProbCut = 0.9,
  imputeNA = FALSE,
  byrow = TRUE,
  imputeValue = c("median", "mean")
)

Arguments

trainDat

a training data set that is in data matrix or a data frame, samples are in columns, and features/traits are in rows

standardization

a logic variable to indicate if standardization is needed before classification score calculation

selectedTraits

a selected feature/trait list if available

groupInfo

a known group classification, which order should be in the same as in columns of trainDat

refGroup

the code for reference group, default is the 1st item in groupInfo

topN

an integer to indicate how many top features to be selected

FDRcut

a FDR cutoff to select top features, which is only valid when topN is set as defaul NULL, all features will be returned if both topN and FDRcut are set as default NULL

weightMethod

a string to indicate a weight calculation method or a significant variable selection method, there are five choices: "limma" for limma linear model based t value,"ttest" for t test based t value, "MannWhitneyU" for Mann Whitney U based rank-biserial,"PearsonR" for Pearson correlation coefficient, "SpearmanR" for Spearman correlation coefficient, and the defualt value is "ttest"

classProbCut

a numeric variable within (0,1), which is a cutoff of Empirical Bayesian probability, often used values are 0.8 and 0.9, default value is 0.9. The same classProbCut is used for both groups, the samples that are not included in either group will be assigned as UNCLASS

imputeNA

a logic variable to indicate if NA imputation is needed, if it is TRUE, NA imputation is processed before any other steps, the default is FALSE

byrow

a logic variable to indicate direction for imputation, default is TRUE, which will use the row data for imputation

imputeValue

a character variable to indicate which value to be used to replace NA, default is "median", the median value of the chose direction with "byrow" data to be used

Details

LPS calculation is based on Wright 2003. The formula is straightforward: LPS(X) = \sum a_j x_ij Here a_j represents the jth selected feature weights, and x_ij is the corresponding feature value for the ith sample. Before we start LPS training, we provide options for NA imputation and standardization. There are three steps for LPS training: a) apply "getTrainingWeights" to select features and return weights for these features; b) use "apply" function to get LPS classification scores and Empirical Bayes' probabilites for all samples; When a Empirical Bayesian probability is calculated, by default, the 1st group in the input mean and sd vectors is treated as the test group. When we calculate the probabilities, we first calcualte probability that a sample belongs to either group, and then use the following formula to get Empirical Bayesian probability: prob(x) = d_test(x)/(d_test(x) + d_ref(x)) Here prob(x) is the Empirical Bayesian probability of a given sample, d_test(x) is the density value assuming that a given sample belongs to the test group, d_ref(x) is the density value assuming that a given sample belongs to the reference group. In the current LPStraining function, however, we calculate Empirical Bayesian probabilities for both directions. c) This wrap-up function also gives classification for the training group and confusion matrix to compare LPS classification with original group info for training data set. If NAs are not imputed, they are ignored for feature selection, weight calculation, LPS parameter estimation, and LPS calculation.

Value

A list with four items is returned: LPS parameters for selected features, LPS scores and classifications for training samples, confusion matrix to compare classification based on LPS scores and original classification, and a simple classification comparison table.

LPS_pars

a list of 2 items, the 1st item is a data frame with weights and group comparison results of each selected features for LPS calculation, and the 2nd item is a numeric vector containing LPS mean and sd for two groups

LPS_train

a data frame of LPS score, true classification, Empirical Bayesian probabilites for both groups, and its classification for all training samples, notice that the classification is based on probabilities instead of LPS scores, and there is a UNCLASS group besdies the given two groups

classCompare

a confusion matrix list object that compares LPS classification (based on selected features and weights) to input group classification for training data set, notice that the samples with UNCLASS are excluded since confusion matrix can not compare 3 groups to 2 groups

classTable

a table to display comparison between LPS classification (based on selected features and weights) and input group classification for the given training data set. Since UNCLASS is excluded from confusion matrix, we add this table for full comparison

Author(s)

Aixiang Jiang

References

Bauer, D. F. (1972). Constructing confidence sets using rank statistics. Journal of the American Statistical Association 67, 687–690. doi: 10.1080/01621459.1972.10481279.

Best, D. J. & Roberts, D. E. (1975). Algorithm AS 89: The Upper Tail Probabilities of Spearman's rho. Applied Statistics, 24, 377–379. doi: 10.2307/2347111.

Hollander, M. & Wolfe, D.A. (1973), Nonparametric Statistical Methods. New York: John Wiley & Sons. Pages 185–194 (Kendall and Spearman tests).

Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, No. 1, Article 3. http://www.statsci.org/smyth/pubs/ebayes.pdf

Smyth, G. K., Michaud, J., and Scott, H. (2005). The use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics 21(9), 2067-2075.

Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6.


ajiangsfu/PRPS documentation built on April 29, 2023, 10:13 p.m.