PStraining | R Documentation |
This is the wrap up function to select top features, estimate parameters, and calculate PS (Prediction Strength) scores based on a given training data set.
PStraining(
trainDat,
selectedTraits = NULL,
groupInfo,
refGroup = NULL,
topN = NULL,
FDRcut = NULL,
weightMethod = c("ttest", "limma", "PearsonR", "SpearmanR", "MannWhitneyU"),
classProbCut = 0.9,
imputeNA = FALSE,
byrow = TRUE,
imputeValue = c("median", "mean")
)
trainDat |
training data set, a data matrix or a data frame, samples are in columns, and features/traits are in rows |
selectedTraits |
a selected trait list if available |
groupInfo |
a known group classification, which order should be the same as in colnames of trainDat |
refGroup |
the code for reference group, default is the 1st item in groupInfo |
topN |
an integer to indicate how many top features to be selected |
FDRcut |
a FDR cutoff to select top features, which is only valid when topN is set as defaul NULL, all features will be returned if both topN and FDRcut are set as default NULL |
weightMethod |
a string to indicate weight calculation method, there are five choices: "limma" for for limma linear model based t value,"ttest" for t test based t value, "MannWhitneyU" for Mann Whitney U based rank-biserial,"PearsonR" for Pearson correlation coefficient, "SpearmanR" for Spearman correlation coefficient, and the defualt value is "limma" |
classProbCut |
a numeric variable within (0,1), which is a cutoff of Empirical Bayesian probability, often used values are 0.8 and 0.9, default value is 0.9. Only one value is used for both groups, the samples that are not included in either group will be assigned as UNCLASS |
imputeNA |
a logic variable to indicate if NA imputation is needed, if it is TRUE, NA imputation is processed before any other steps, the default is FALSE |
byrow |
a logic variable to indicate direction for imputation, default is TRUE, which will use the row data for imputation |
imputeValue |
a character variable to indicate which value to be used to replace NA, default is "median", the median value of the chose direction with "byrow" data to be used |
PS calculation is based on Golub 1999. In this wrap up function, we use four steps to calculate
PS scores and classification. The range of PS scores is [-1,1]. Before these four steps, there is an option
for NA imputation, but standardization is required. The four steps are:
a) apply "standardize" to standardize input data matrix for each feature;
b) apply "getTrainingWeights" to select features and return weights for these features;
c) apply "getMeanOfGroupMeans" to get mean of group means for each selected feature;
d) use "apply" function to get PS scores for all samples with "getPS1sample", the formula is:
PS = (V_win − V_lose)/(V_win + V_lose)
Here, where V_win and V_lose are the vote totals for the winning and losing features/traits for a given sample
The theoretical cutoff for PS is 0, in addition, we also classification based on Empirical Bayesian.
When we calculate a Empirical Bayes' probability, the 1st group in the input mean and sd vectors is treated
as the test group.
When we calculate the probabilities, we first calcualte probability that a sample belongs to either group,
and then use the
following formula to get Empirical Bayes' probability:
prob(x) = d_test(x)/(d_test(x) + d_ref(x))
Here prob(x) is the Empirical Bayes' probability of a given sample, d_test(x) is the density value
that a given sample belongs to the test group, d_ref(x) is the density value that a given sample belongs
to the reference group.
Notice that the test and reference group is just the relative grouping, in fact, for this step,
we often need to calculate Empirical Bayes' probabilities for a given sample from two different standing points.
This function also give classification for the training group and confusion matrix to compare PS classification with original group info for training data set. If NAs are not imputed, they are ignored for feature selection, weight calculation, PS parameter estimation, and PS calculation.
A list with three items is returned: PS parameters for selected features, PS scores and classifications for training samples, and confusion matrix to compare classification based on PS scores and original classification.
PS_pars |
a data frame with all parameters needed for PS calculation for each selected features |
PS_train |
a data frame of PS score, true classification and its classification based on scores for all training samples |
classCompare |
a confusion matrix list object that compare PS classification based on selected features and weights compared to input group classification for training data set |
classTable |
a table to display comparison of PS classification based on selected features and weights compared to input group classification for training data set |
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.