PRPStraining: PRPS training

View source: R/PRPStraining.R

PRPStrainingR Documentation

PRPS training

Description

This is the wrap up function to select top features, estimate parameters, and calculate PRPS (Probability ratio based classification predication score) scores based on a given training data set.

Usage

PRPStraining(
  trainDat,
  standardization = FALSE,
  selectedTraits = NULL,
  groupInfo,
  refGroup = NULL,
  topN = NULL,
  FDRcut = NULL,
  weightMethod = c("ttest", "limma", "PearsonR", "SpearmanR", "MannWhitneyU"),
  classProbCut = 0.9,
  imputeNA = FALSE,
  byrow = TRUE,
  imputeValue = c("median", "mean")
)

Arguments

trainDat

training data set, a data matrix or a data frame, samples are in columns, and features/traits are in rows

standardization

a logic variable to indicate if standardization is needed before classification score calculation

selectedTraits

a selected trait list if available

groupInfo

a known group classification, which order should be the same as in colnames of trainDat

refGroup

the code for reference group, default is the 1st item in groupInfo

topN

an integer to indicate how many top features to be selected

FDRcut

a FDR cutoff to select top features, which is only valid when topN is set as defaul NULL, all features will be returned if both topN and FDRcut are set as default NULL

weightMethod

a string to indicate weight calculation method, there are five choices: "limma" for for limma linear model based t value,"ttest" for t test based t value, "MannWhitneyU" for Mann Whitney U based rank-biserial,"PearsonR" for Pearson correlation coefficient, "SpearmanR" for Spearman correlation coefficient, and the defualt value is "limma"

classProbCut

a numeric variable within (0,1), which is a cutoff of Empirical Bayesian probability, often used values are 0.8 and 0.9, default value is 0.9. Only one value is used for both groups, the samples that are not included in either group will be assigned as UNCLASS

imputeNA

a logic variable to indicate if NA imputation is needed, if it is TRUE, NA imputation is processed before any other steps, the default is FALSE

byrow

a logic variable to indicate direction for imputation, default is TRUE, which will use the row data for imputation

imputeValue

a character variable to indicate which value to be used to replace NA, default is "median", the median value of the chose direction with "byrow" data to be used

Details

PRPS calculation is based on Ennishi 2018, its formula is: PRPS(X_i) = \sum (|a_j| log10(P1(x_ij)/P0(x_ij))) Here, a_j represents the jth selected feature weights, and x_ij is the corresponding feature value for the ith sample, P1 and P0 are the probabilities that the ith sample belongs to two different groups. In this wrap up function, we use three steps to calculate PRPS scores and classification. Before these three steps, we also give an option for NA imputation and for standardization for each feature. The three steps are: a) Apply "getTrainingWeights" to select features and return weights for these features. b) Use "apply" function to get PRPS classification scores and Empirical Bayes' probabilites for all samples. When we calculate a Empirical Bayes' probability, the 1st group in the input mean and sd vectors is treated as the test group. When we calculate the probabilities, we first calcualte probability that a sample belongs to either group, and then use the following formula to get Empirical Bayes' probability: prob(x) = d_test(x)/(d_test(x) + d_ref(x)) Here prob(x) is the Empirical Bayes' probability of a given sample, d_test(x) is the density value that a given sample belongs to the test group, d_ref(x) is the density value that a given sample belongs to the reference group. Notice that the test and reference group is just the relative grouping, in fact, for this step, we often need to calculate Empirical Bayes' probabilities for a given sample from two different standing points. c) This function also give classification for the training group and confusion matrix to compare PRPS classification with original group info for training data set. If NAs are not imputed, they are ignored for feature selection, weight calculation, PRPS parameter estimation, and PRPS calculation.

Value

A list with three items is returned: PRPS parameters for selected features, PRPS scores and classifications for training samples, and confusion matrix to compare classification based on PRPS scores and original classification.

PRPS_pars

a list of 3 items, the 1st item is a data frame with weights and group testing results of each selected features for PRPS calculation, the 2nd item is a numeric vector containing PRPS mean and sd for two groups,and the 3rd item is a data frame contains mean and sd for each group and for each selected feature

PRPS_train

a data frame of PRPS score, true classification, Empirical Bayesian probabilites for both groups, and its classification for all training samples, notice that there are two ways for classifications, one is based on probabilities, and there is UNCLASS group besdies the given two groups, alternatively, the other one is based on PRPS scores directly and 0 treated as a natural cutoff

classCompare

a confusion matrix list object that compare PRPS classification based on selected features and weights compared to input group classification for training data set, notice that the samples with UNCLASS are excluded since confusion matrix can not compare 3 groups to 2 groups

classTable

a table to display comparison of PRPS classification based on selected features and weights compared to input group classification for training data set. Since UNCLASS is excluded from confusion matrix, add this table for full comparison

References

Ennishi D, Jiang A, Boyle M, Collinge B, Grande BM, Ben-Neriah S, Rushton C, Tang J, Thomas N, Slack GW, Farinha P, Takata K, Miyata-Takata T, Craig J, Mottok A, Meissner B, Saberi S, Bashashati A, Villa D, Savage KJ, Sehn LH, Kridel R, Mungall AJ, Marra MA, Shah SP, Steidl C, Connors JM, Gascoyne RD, Morin RD, Scott DW. Double-Hit Gene Expression Signature Defines a Distinct Subgroup of Germinal Center B-Cell-Like Diffuse Large B-Cell Lymphoma. J Clin Oncol. 2018 Dec 3:JCO1801583. doi: 10.1200/JCO.18.01583.

Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A trait expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6.


ajiangsfu/PRPS documentation built on April 29, 2023, 10:13 p.m.