Introduction

This package provides a feature selection method for single-cell RNA-seq data. It encodes the category number and calculate the spearman correlation coefficient.

The encoding the category number (CAEN) method is developed for selecting feature genes, which is differentially expressed between classes. We have implemented the CAEN method via a set of R functions. We make a R package named CAEN, and give a tutorial for the package. The method consist three steps.

Step 1: Data Pre-processing;

Step 2: Encoding the category number; Calculating the spearman correlation coefficient between the gene and category number.

Step 3: Calculate the classification error using the genes selected.

We employ a simulation dataset to illustrate the usage of the CAEN package. The programs can run under the Linux system and windows 10 system. The R versions should larger than 4.0.

Preparations

To install the CAEN package into your R environment, start R and enter:

install.packages("BiocManager")
BiocManager::install("CAEN")

Then, the CAEN package is ready to load.

library(CAEN)
library(SummarizedExperiment)

Data format

In order to reproduce the presented CAEN workflow, the package includes the example data sets, which is generated by function newCountDataSet(). The next we will give an example for how to generate simulation dataset:

dat <- newCountDataSet(n = 100, p = 500, K = 4, param = 10,
                       sdsignal = 2, drate = 0.2)

The output of the function newCountDataSet() includes: "sim_train_data" represents training data of qn data matrix.
"sim_test_data" represents test data of qn data matrix.
The colnames of this two matrix are class labels for the n observations. May have q

0 total counts in dataset. So q <= p.
"truesf" denotes size factors for training observations.
"isDE" represnts the differential gene label.

Calculate the spearman correlation coefficient for data

For the category number, we need to consider not only the difference between class but also the Intra-category difference. Therefore, we propose CAEN method, by encoding the category number within class, it get the optimal category number and select the most important genes used for classification.

x <- t(assay(dat$sim_train_data))                  
y <- as.numeric(colnames(dat$sim_train_data))      
xte <- t(assay(dat$sim_test_data))                 

prob <- estimatep(x = x, y = y, xte = x, beta = 1, 
                  type = c("mle","deseq","quantile"),
                  prior = NULL)      
prob0 <- estimatep(x = x, y = y, xte = xte, beta = 1, 
                   type = c("mle","deseq","quantile"),
                   prior = NULL)   
myscore <- CAEN(dataTable = assay(dat$sim_train_data), 
                y = as.numeric(colnames(dat$sim_train_data)), K = 4,
                gene_no_list = 100)

The output of the function CAEN is: A list of computed correlation coefficient and the first some differentially expressed genes , where "r" represents correlation coefficient between gene and category number, and "np" represents the top differential feature label.

Calculate classification error rate using genes selected with CAEN method

Getting the important gene, we Calculate classification error rate using genes selected. The step is as follows:

ddd <- myscore$np
datx <- x[,ddd]
datxte <- xte[,ddd]
probb <- prob[ddd,]
probb0 <- prob0[ddd,]

zipldacv.out <- ZIPDA.cv(x = datx, y = y, prob0 = t(probb))
ZIPLDA.out <- ZIPLDA(x = datx, y = y,
                     xte = datxte, transform = FALSE, prob0 = t(probb0),
                     rho = zipldacv.out$bestrho)
classResult <- ZIPLDA.out$ytehat


zhangli1109/ENTC documentation built on Nov. 10, 2020, 11:16 p.m.