R/featureSelection.R
In CancerSubtypes: Cancer subtypes identification, validation and visualization based on multiple genomic data sets

Documented in FSbyCox FSbyMAD FSbyPCA FSbyVar

#'Biological feature (such as gene) selection based on Cox regression model.
#'
#' Cox model (Proportional hazard model) is a statistical approach for survival risk analysis. We applied the univariate Cox model
#' for feature selection. The proportional hazard assumption test is used to evaluate the significant level of each biological feature
#' related to the survival result for samples. 
#' Eventually, the most significant genes are selected for clustering analysis. 
#' 
#' @importFrom survival coxph Surv
#' @param Data A data matrix representing the genomic data measured in a set of samples.
#' For the matrix, the rows represent the genomic features, and the columns represent the samples.
#' @param time A numeric vector representing the survival time (days) of a set of samples. Note that the
#' order of the time should map the samples in the Data matrix.
#' @param status A numeric vector representing the survival status of a set of samples. 0=alive/censored, 1=dead. Note that the
#' order of the time should map the samples in the Data matrix.
#' @param cutoff A numeric value in (0,1) representing whether the significant feature Xi is selected
#' according to the Proportional Hazards Assumption p-value of the feature Xi. If p-value(Xi)<cutoff, the
#' features Xi will be selected for downstream analysis. Normally the significant level is set to 0.05.
#' @return A data matrix, extracted a subset with significant features from the input data matrix.
#' The rows represent the significant features, and the columns represents the samples.
#' @author
#'  Xu,Taosheng \email{taosheng.x@@gmail.com}, Thuc Le \email{Thuc.Le@@unisa.edu.au}
#'  
#'  
#' @examples
#' data(GeneExp)
#' data(time)
#' data(status)
#' data1=FSbyCox(GeneExp,time,status,cutoff=0.05)
#' @references
#' Andersen, P. and Gill, R. (1982). Cox's regression model for counting processes, a large sample study. Annals of Statistics 10, 1100-1120.\cr
#' Therneau, T., Grambsch, P., Modeling Survival Data: Extending the Cox Model. Springer-Verlag, 2000.
#' @export
FSbyCox<-function(Data,time,status,cutoff=0.05)
{
  feature_cox_test=NULL;
  for(i in 1:nrow(Data))
  {
    dataset=list(time,status,x<-Data[i,])
    temp<-coxph(Surv(time, status) ~ x, dataset)
    feature_cox_test=rbind(feature_cox_test,summary(temp)$coefficients)
  }
  indexNum=which(feature_cox_test[,5]<cutoff)
  selectData=Data[indexNum,]
  selectData
}



#' Biological feature (such as gene) selection based on the most variant Median Absolute Deviation (MAD).
#'
#' @param Data A data matrix representing the genomic data measured in a set of samples.
#' For the matrix, the rows represent the genomic features, and the columns represents the samples.
#' @param cut.type A character value representing the selection type. The optional values are shown below:
#' \itemize{
#' \item "topk"
#' \item "cutoff"
#' }
#' @param value A numeric value.\cr
#' If the cut.type="topk", the top number of value features are selected.\cr
#' If the cut.type="cutoff", the features with (MAD>value) are selected.
#' @return An extracted subset data matrix with the most variant MAD features from the input data matrix.
#' @author
#'  Xu,Taosheng \email{taosheng.x@@gmail.com}, Thuc Le \email{Thuc.Le@@unisa.edu.au}
#'  
#'  
#' @examples
#' data(GeneExp)
#' data1=FSbyMAD(GeneExp, cut.type="topk",value=1000)
#' 
#' @export
FSbyMAD<-function(Data, cut.type="topk", value)
{
  mads=apply(Data,1,mad)
  feature_num=length(mads)
  hist(mads, breaks=feature_num*0.1, col="red",
       main="Expression (MAD) distribution",
       xlab="The MAD of feature")
  if(cut.type=="topk")
  {
    index= sort(mads,decreasing = TRUE,index.return=TRUE)
    if(value>nrow(Data))
    {
      value=nrow(Data)
      cat("Warning: the feature selection number is beyond the original feature numnber")
    }
    cutoff=index$x[value]
    abline(v=cutoff,col = "blue",lty = 5,lwd=1.5)
    index=index$ix[1:value]
    selectData=Data[index,]
  }
  if(cut.type=="cutoff")
  {
    abline(v=value,col = "blue",lty = 5,lwd=1.5)
    index=which(mads>value)
    selectData=Data[index,]
  }
  selectData
  
  
}


#' Biological feature (such as gene) selection based on the most variance.
#' @param Data A data matrix representing the genomic data measured in a set of samples.
#' For the matrix, the rows represent the genomic features, and the columns represents the samples.
#' @param cut.type A character value representing the selection type. The optional values are shown below:
#' \itemize{
#' \item "topk"
#' \item "cutoff"
#' }
#' @param value A numeric value.\cr
#' If the cut.type="topk", the top number of value features are selected.\cr
#' If the cut.type="cutoff", the features with (var>value) are selected.
#' @return An extracted subset data matrix with most variance features from the input data matrix.
#' @author
#'  Xu,Taosheng \email{taosheng.x@@gmail.com}, Thuc Le \email{Thuc.Le@@unisa.edu.au}
#'  
#'  
#' @examples
#' data(GeneExp)
#' data1=FSbyVar(GeneExp, cut.type="topk",value=1000) 
#' 
#' @export
FSbyVar<-function(Data, cut.type="topk", value)
{
  vars=apply(Data,1,var)
  feature_num=length(vars)
  hist(vars, breaks=feature_num*0.1, col="red",
       main="Expression (Variance) distribution",
       xlab="The Variance of feature")
  if(cut.type=="topk")
  {
    index= sort(vars,decreasing = TRUE,index.return=TRUE)
    if(value>nrow(Data))
    {
      value=nrow(Data)
      cat("Warning: the feature selection number is beyond the original feature numnber")
    }
    cutoff=index$x[value]
    abline(v=cutoff,col = "blue",lty = 5,lwd=1.5)
    index=index$ix[1:value]
    selectData=Data[index,]
  }
  if(cut.type=="cutoff")
  {
    abline(v=value,col = "blue",lty = 5,lwd=1.5)
    index=which(vars>value)
    selectData=Data[index,]
  }
  selectData
}


#' Biological feature (such as gene) dimension reduction and extraction based on Principal Component Analysis.
#' 
#' This function is based on the prcomp(), we write a shell for it and make it easy to use on genomic data.
#'
#' @param Data A data matrix representing the genomic data measured in a set of samples.
#' For the matrix, the rows represent the genomic features, and the columns represents the samples.
#' @param PC_percent A numeric values in [0,1]  representing the ratio  of principal component is seclected.
#' @param scale A bool variable, If true, the Data is normalized before PCA.
#'
#' @return
#' A new matrix with full or part Principal Component in new projection space.
#' @author
#'  Xu,Taosheng \email{taosheng.x@@gmail.com},Thuc Le \email{Thuc.Le@@unisa.edu.au}
#'  
#'  
#' @examples
#' data(GeneExp)
#' data1=FSbyPCA(GeneExp, PC_percent=0.9,scale = TRUE)
#'  
#' @export
#'
FSbyPCA<-function(Data,PC_percent=1,scale = TRUE)
{
  newData=t(Data)
  aa=prcomp(newData,scale = scale)
  vars <- apply(aa$x, 2, var)
  props <- vars / sum(vars)
  xx=as.vector(cumsum(props))
  num=which(xx>PC_percent)[1]
  coeff=aa$ro
  score=(newData%*%coeff)[,1:num]
  result=t(score)
  result
}

Any scripts or data that you put into this service are public.

CancerSubtypes documentation built on Nov. 8, 2020, 8:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CancerSubtypes
Cancer subtypes identification, validation and visualization based on multiple genomic data sets

R/featureSelection.R
In CancerSubtypes: Cancer subtypes identification, validation and visualization based on multiple genomic data sets

Defines functions FSbyPCA FSbyVar FSbyMAD FSbyCox

Documented in FSbyCox FSbyMAD FSbyPCA FSbyVar

Try the CancerSubtypes package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

CancerSubtypes Cancer subtypes identification, validation and visualization based on multiple genomic data sets

R/featureSelection.R In CancerSubtypes: Cancer subtypes identification, validation and visualization based on multiple genomic data sets

Defines functions FSbyPCA FSbyVar FSbyMAD FSbyCox

Documented in FSbyCox FSbyMAD FSbyPCA FSbyVar

Try the CancerSubtypes package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

CancerSubtypes
Cancer subtypes identification, validation and visualization based on multiple genomic data sets

R/featureSelection.R
In CancerSubtypes: Cancer subtypes identification, validation and visualization based on multiple genomic data sets