CAM3Prep: Data preprocessing for CAM3

View source: R/CAM3Prep.R

CAM3PrepR Documentation

Data preprocessing for CAM3

Description

This function perform preprocessing for CAM, including norm-based filtering, dimension deduction, perspective projection, local outlier removal and aggregation of gene expression vectors by clustering.

Usage

CAM3Prep(
  data,
  dim.rdc = 10,
  thres.low = 0.05,
  thres.high = 1,
  cluster.method = c("Fixed-Radius", "K-Means"),
  radius.thres = 0.95,
  sim.thres = 0.95,
  cluster.num = 50,
  MG.num.thres = 20,
  sample.weight = NULL
)

Arguments

data

Matrix of mixture expression profiles. Data frame, SummarizedExperiment or ExpressionSet object will be internally coerced into a matrix. Each row is a gene and each column is a sample. Data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally.

dim.rdc

Reduced data dimension; should be not less than maximum candidate K.

thres.low

The lower bound of percentage of genes to keep for CAM with ranked norm. The value should be between 0 and 1. The default is 0.05.

thres.high

The higher bound of percentage of genes to keep for CAM with ranked norm. The value should be between 0 and 1. The default is 1.

cluster.method

The method to do clustering. The default "Fixed-Radius" will make all the clusters with the same size. The alternative "K-Means" will use kmeans.

radius.thres

The "cosine" radius of "Fixed-Radius" clustering. The default is 0.95

sim.thres

The cosine similarity threshold of cluster centers. For clusters with cosine similarity higher than the threshold, they would be merged until the number of clusters equals to cluster.num. This parameter could control the upper bound of similarity amoung sources. The default is 0.95.

cluster.num

The lower bound of cluster number, which should be much larger than K. The default is 50.

MG.num.thres

The clusters with the gene number smaller than MG.num.thres will be treated as outliers. The default is 20.

sample.weight

Vector of sample weights. If NULL, all samples have the same weights. The length should be the same as sample numbers. All values should be positive.

Details

This function is used internally by CAM3Run function to preprocess data, or used when you want to perform CAM step by step.

Low/high-expressed genes are filtered by their L2-norm ranks. Dimension reduction is slightly different from PCA. The first loading vector is forced to be c(1,1,...,1) with unit norm normalization. The remaining are eigenvectors from PCA in the space orthogonal to the first vector. Perspective projection is to project dimension-reduced gene expression vectors to the hyperplane orthogonal to c(1,0,...,0), i.e., the first axis in the new coordinate system. Finally, gene expression vectors are aggregated by clustering to further reduce the impact of noise/outlier and help improve the efficiency of simplex corner detection.

Value

An object of class "CAMPrepObj" containing the following components:

Valid

logical vector to indicate the genes left after filtering.

Xprep

Preprocessed data matrix.

Xproj

Preprocessed data matrix after perspective projection.

W

The matrix whose rows are loading vectors.

SW

Sample weights.

cluster

cluster results including two vectors. The first indicates the cluster to which each gene is allocated. The second is the number of genes in each cluster.

c.outlier

The clusters with the gene number smaller than MG.num.thres.

centers

The centers of candidate corner clusters (candidate clusters containing marker genes).

Examples

#obtain data
data(ratMix3)
data <- ratMix3$X

#preprocess data
rPrep3 <- CAM3Prep(data, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95)

ChiungTingWu/CAM3 documentation built on Feb. 14, 2024, 9:22 a.m.