PRPSstableSLwithWeights: PRPS stable self-training
In ajiangsfu/PRPS: Binary classification with PRPS, PRPS-ST and more

View source: R/PRPSstableSLwithWeights.R

PRPSstableSLwithWeights

R Documentation

PRPS stable self-training

Description

This function is to calculate PRPS (Probability ratio based classification predication score) scores and make binary classification calls for a testing data set without PRPS training object. It involves a self-training process with given features and their weights.

Usage

PRPSstableSLwithWeights(
  newdat,
  weights,
  plotName = NULL,
  ratioRange = c(0.05, 0.95),
  stepby = 0.05,
  standardization = FALSE,
  classProbCut = 0.9,
  PRPShighGroup = "PRPShigh",
  PRPSlowGroup = "PRPSlow",
  breaks = 50,
  imputeNA = FALSE,
  byrow = TRUE,
  imputeValue = c("median", "mean")
)

Arguments

`newdat`	a input data matrix or data frame, columns for samples and rows for features
`weights`	a numeric vector with selected features (as names of the vector) and their weights
`plotName`	a pdf file name with full path and is ended with ".pdf", which is used to save multiple pages of PRPS histgrams with distribution densities. Default value us NULL, no plot is saved.
`ratioRange`	a numeric vector with two numbers, which indicates ratio search range. The default is c(0.05, 0.95), which should NOT be changed in most of situations. However, if your classification is very unbalanced such as one group is much smaller than the other, and/or sample variation is quite big, and/or classification results are far away from what you expect, you might want to change the default values. c(0.15, 0.85) is recommended as an alternative setting other than default. In an extreme rare situation, c(0.4, 0,6) could a good try.
`stepby`	a numeric parameter for distance between percentage searching step, it should be within (0,1), default value is 0.05, but a user can change it to other values such as 0.01
`standardization`	a logic variable to indicate if standardization is needed before classification score calculation
`classProbCut`	a numeric variable within (0,1), which is a cutoff of Empirical Bayesian probability, often used values are 0.8 and 0.9, default value is 0.9. Only one value is used for both groups, the samples that are not included in either group will be assigned as UNCLASS
`PRPShighGroup`	a string to indicate group name with high PRPS score
`PRPSlowGroup`	a string to indicate group name with low PRPS score
`breaks`	a integer to indicate number of bins in histogram, default is 50
`imputeNA`	a logic variable to indicate if NA imputation is needed, if it is TRUE, NA imputation is processed before any other steps, the default is FALSE
`byrow`	a logic variable to indicate direction for imputation, default is TRUE, which will use the row data for imputation
`imputeValue`	a character variable to indicate which value to be used to replace NA, default is "median", the median value of the chose direction with "byrow" data to be used

Details

This function is trying to get reasonable PRPS based classification without training data set, but with selected features and their weights. The actual steps are as following: 1) assume that we have a pool for group ratio priors such as seq(0.05, 0.95, by = 0.05) for default ratioRange = c(0.05, 0.95) 2) With given features and their weights a) for each prior in 1), call PSSLwithWeightsPrior with given features and weights to achieve PRPS scores apply EM on PRPS scores with Mclust, get 2 group classification b) define the samples that are always in the same classes across searching range as stable classes 3) repeat step 2) but this time with opposite signs in the given weights, result in another set of stable classes 4) get final stable classes that are common in 2) and 3) 5) use final stable classes to get group means and sds for each feature and for each group 5) calculate PRPS scores 6) Once we have PRPS scores, we could use the theoretic natual cutoff 0 to make classification calls, which may or may not appropriate. Alternatively, with two groups based on stable classes assuming that PRPS score is a mixture of two normal distributions, we can get Empirical Bayesian probability and make calls

Value

A list with two items is returned: PRPS parameters for selected features, PRPS scores and classifications for the given samples.

`PRPS_pars`	a list of 3 items, the 1st item is a data frame with weights of each selected features for PRPS calculation, the 2nd item is a numeric vector containing PRPS mean and sd for two groups，the 3rd item is a data frame contains mean and sd for each group and for each selected feature based on stable classes
`PRPS_test`	a data frame of PRPS score, classification and two groups' Empirical Bayesian probabilites based on stable classes, and classification with natural 0 cutoff

Author(s)

Aixiang Jiang

References

Ennishi D, Jiang A, Boyle M, Collinge B, Grande BM, Ben-Neriah S, Rushton C, Tang J, Thomas N, Slack GW, Farinha P, Takata K, Miyata-Takata T, Craig J, Mottok A, Meissner B, Saberi S, Bashashati A, Villa D, Savage KJ, Sehn LH, Kridel R, Mungall AJ, Marra MA, Shah SP, Steidl C, Connors JM, Gascoyne RD, Morin RD, Scott DW. Double-Hit Gene Expression Signature Defines a Distinct Subgroup of Germinal Center B-Cell-Like Diffuse Large B-Cell Lymphoma. J Clin Oncol. 2018 Dec 3:JCO1801583. doi: 10.1200/JCO.18.01583.

Ultsch, A., Thrun, M.C., Hansen-Goos, O., Loetsch, J.: Identification of Molecular Fingerprints in Human Heat Pain Thresholds by Use of an Interactive Mixture Model R Toolbox(AdaptGauss), International Journal of Molecular Sciences, doi:10.3390/ijms161025897, 2015.

Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233.

ajiangsfu/PRPS documentation built on April 29, 2023, 10:13 p.m.