PSstableSLwithWeights: PS stable self-training

View source: R/PSstableSLwithWeights.R

PSstableSLwithWeightsR Documentation

PS stable self-training

Description

This function is to calculate PS (Prediction Strength) scores and make binary classification calls for a testing data set without PS training object. It involves a self-training process with given features and their weights.

Usage

PSstableSLwithWeights(
  newdat,
  weights,
  plotName = NULL,
  ratioRange = c(0.1, 0.9),
  stepby = 0.05,
  classProbCut = 0.9,
  PShighGroup = "PShigh",
  PSlowGroup = "PSlow",
  breaks = 50,
  imputeNA = FALSE,
  byrow = TRUE,
  imputeValue = c("median", "mean")
)

Arguments

newdat

a input data matrix or data frame, columns for samples and rows for features

weights

a numeric vector with selected features (as names of the vector) and their weights

plotName

a pdf file name with full path and is ended with ".pdf", which is used to save multiple pages of PS histgrams with distribution densities. Default value us NULL, no plot is saved.

ratioRange

a numeric vector with two numbers, which indicates ratio search range. The default is c(0.1, 0.9)for the current function. If your classification is very unbalanced such as one group is much smaller than the other, and/or sample variation is quite big, and/or classification results are far away from what you expect, you might want to change the default values. c(0.15, 0.85) is recommended as an alternative setting other than default. In an extreme rare situation, c(0.4, 0,6) could a good try.

stepby

a numeric parameter for distance between percentage searching step, it should be within (0,1), default value is 0.05, but a user can change it to other values such as 0.01

classProbCut

a numeric variable within (0,1), which is a cutoff of Empirical Bayesian probability, often used values are 0.8 and 0.9, default value is 0.9. Only one value is used for both groups, the samples that are not included in either group will be assigned as UNCLASS

PShighGroup

a string to indicate group name with high PS score

PSlowGroup

a string to indicate group name with low PS score

breaks

a integer to indicate number of bins in histogram, default is 50

imputeNA

a logic variable to indicate if NA imputation is needed, if it is TRUE, NA imputation is processed before any other steps, the default is FALSE

byrow

a logic variable to indicate direction for imputation, default is TRUE, which will use the row data for imputation

imputeValue

a character variable to indicate which value to be used to replace NA, default is "median", the median value of the chose direction with "byrow" data to be used

Details

This function is trying to get reasonable PS based classification without training data set, but with selected features and their weights. The actual steps are as following: 1) assume that we have a pool for group ratio priors such as seq(0.05, 0.95, by = 0.05) for default ratioRange = c(0.05, 0.95) 2) With given features and their weights a) for each prior in 1), call PSSLwithWeightsPrior with given features and weights to achieve PS scores apply EM on PS scores with Mclust, get 2 group classification b) define the samples that are always in the same classes across searching range as stable classes 3) repeat step 2) but this time with opposite signs in the given weights, result in another set of stable classes 4) get final stable classes that are common in 2) and 3) 5) use final stable classes to get group means and sds for each feature and for each group 5) calculate PS scores 6) Once we have PS scores, we could use the theoretic natual cutoff 0 to make classification calls, which may or may not appropriate. Alternatively, with two groups based on stable classes assuming that PS score is a mixture of two normal distributions, we can get Empirical Bayesian probability and make calls

Value

A list with two items is returned: PS parameters for selected features, PS scores and classifications for the given samples.

PS_pars

a list of 3 items, the 1st item is a data frame with weights of each selected features for PS calculation, the 2nd item is a numeric vector containing PS mean and sd for two groups,the 3rd item is a data frame contains group means for each group and mean of these two means for each feature based on stable classes

PS_test

a data frame of PS score and classification with natural 0 cutoff

Author(s)

Aixiang Jiang

References

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7

Ultsch, A., Thrun, M.C., Hansen-Goos, O., Loetsch, J.: Identification of Molecular Fingerprints in Human Heat Pain Thresholds by Use of an Interactive Mixture Model R Toolbox(AdaptGauss), International Journal of Molecular Sciences, doi:10.3390/ijms161025897, 2015.

Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233.


ajiangsfu/PRPS documentation built on April 29, 2023, 10:13 p.m.