Linnorm-PCA Clustering pipeline for subpopulation Analysis

Share:

Description

This function first performs Linnorm transformation on the dataset. Then, it will perform Principal component analysis on the dataset and use k-means clustering to identify subpopulations of cells.

Usage

1
2
3
4
Linnorm.PCA(datamatrix, showinfo = FALSE, input = "Raw",
  perturbation = 10, minZeroPortion = 0, keepAll = TRUE, num_PC = 2,
  num_center = c(1:20), Group = NULL, Coloring = "Group",
  pca.scale = FALSE, kmeans.iter = 2000)

Arguments

datamatrix

The matrix or data frame that contains your dataset. Each row is a feature (or Gene) and each column is a sample (or replicate). Raw Counts, CPM, RPKM, FPKM or TPM are supported. Undefined values such as NA are not supported. It is not compatible with log transformed datasets. If a Linnorm transfored dataset is being used, please set the "input" argument into "Linnorm".

showinfo

Logical. Show information about the computing process. Defaults to FALSE.

input

Character. "Raw" or "Linnorm". In case you have already transformed your dataset with Linnorm, set input into "Linnorm" so that you can input the Linnorm transformed dataset into the "datamatrix" argument. Defaults to "Raw".

perturbation

Integer >=2. To search for an optimal minimal deviation parameter (please see the article), Linnorm uses the iterated local search algorithm which perturbs away from the initial local minimum. The range of the area searched in each perturbation is exponentially increased as the area get further away from the initial local minimum, which is determined by their index. This range is calculated by 10 * (perturbation ^ index).

minZeroPortion

Double >=0, <= 1. For example, setting minZeroPortion as 0.5 will remove genes with more than half data values being zero in the calculation of normalizing parameter. Defaults to 0.

keepAll

Logical. After applying minZeroPortion filtering, should Linnorm keep all genes in the results? Defualts to TRUE.

num_PC

Integer >= 2. Number of principal componenets to be used in K-means clustering. Defaults to 3.

num_center

Numeric vector. Number of clusters to be tested for k-means clustering. fpc, vegan, mclust and apcluster packages are used to determine the number of clusters needed. If only one number is supplied, it will be used and this test will be skipped. Defaults to c(1:20).

Group

Character vector with length equals to sample size. Each character in this vector corresponds to each of the columns (samples) in the datamatrix. This is for plotting purposes only. In the plot, the shape of the points that represent each sample will be indicated by their group assignment. Defaults to NULL.

Coloring

Character. "kmeans" or "Group". If Group is not NA, coloring in the PCA plot will reflect each sample's group. Otherwise, coloring will reflect k means clustering results. Defaults to "Group".

pca.scale

Logical. In the prcomp(for Principal component analysis) function, set the "scale." parameter. It signals the function to scale unit variances in the variables before the analysis takes place. Defaults to FALSE.

kmeans.iter

Numeric. Number of iterations in k-means clustering. Defaults to 2000.

Details

This function performs PCA clustering using Linnorm transformation.

Value

It returns a list with the following objects:

  • k_means: Output of kmeans(for K-means clustering) from the stat package. Note: It contains a "cluster" object that indicates each sample's cluster assignment.

  • PCA: Output of prcomp(for Principal component analysis) from the stat package.

  • plot: Plot of PCA clustering.

  • Linnorm: Linnorm transformed and filtered data matrix.

Examples

1
2
3
4
#Obtain example matrix:
data(Islam2011)
#Example:
PCA.results <- Linnorm.PCA(Islam2011)