README.md
In rshudde/RJclust: RJ Clustering Algorithm

RJclust

Unsupervised learning clustering algorithm for data where n << p.

NOTE TO DR. GAYNANOVA: For this package, the readme and vignette both give code examples that give good exmaples of the package's functionality.

Link to original paper

The RJclust algorithm is an unsupervised learning algorithm that finds clusters in data where n << p. The algorithm works by first finding the XX^t matrix, which dramatically reduces the size of the data (from n x p to n x n). Next, the algorithm divides the data into smaller sections. It uses the popular 'mclust' method to find clusters within these smaller sections. It then clusters again based on the clusters found in the subclusters. Finally, it assigns the data to one of the clusters found.

Right now, we are working on applying it to cancer data. This package will allow users to apply the RJ clustering algorithm to a dataset and get back the clusters, along with getting appropriate graphics and several functions we are using for analysis. The cancer data, or some simulated data, will be made available with the package for testing since the algorithm is intended for certain pre-processed data.

This is the start of the RJclust algorithm implementation. Right now, the functions implemented mainly are focusing on TCGA data (note that all functions that are TCGA specific start with 'TCGA').

The algorithm was developed by Dr. Valen Johnson and Dr. Shahina Rhahman, and the coding and TCGA functions implemented by Rachael Shudde.

Generating test data
Cleaning and pre-processing TCGA data (C++ implementation)
Finding important genes given a TCGA dataset and RJ classification results (C++ implementation)
Finding AMI value given two cluster results
Matrix operations specific to the types of data expected in C++
Finding and plotting important patients (C++ functionality)
Core of RJ algorithm (C++ functionality)
Uploading TCGA examples online so that users can access
Function to recompute RJalgorithm removing patients with high gene expressions

Vingnette
More comprehensive data checks / warnings
Some type of plotting ability or analysis of RJ clust
Adding function to run RJclust again without important genes

To install, run the code below. You may need to install some libraries to make it work since this package is not from CRAN.

library(devtools)
devtools::install_github("rshudde/RJclust")

To install and build the vignette, run:

library(devtools)
devtools::install_github("rshudde/RJclust", build = TRUE, build_opts = c("--no-resave-data", "--no-manual"))

Here is a quick example of the RJclust algorithm applied to some TCGA ovarian cancer data that comes with the package:

data(OV)
X = TCGA_cleanData(OV)
clust = RJclust(X, 3)
clust$G
table(clust$classification)

You can explore the clust object created to see things like the number of clusters found (\$G) or the classification of each observation (\$classification).

Here is an example from a simulated data example with 4 clusters (this will take about a minute to run depending on your machine):

data = generateSimulationData()
dim(data$X)
scaledX = scaleRJ(data$X)
clust = RJclust(scaledX, 150)
clust$G
table(clust$classification, data$Y)
f_rez(clust$classification, data$Y)

The f_rez compares how close the classification is to the actual data (on a scale from 0-1).

TCGA_cleanData - preprocess TCGA data (transpose and remove columns with too many zeroes)
TCGA_getImportantPatients - get the IDs of patients who have high gene expressions
TCGA_plotImportantPatients - plot the results from TCGA_getImportantPatients
TCGA_getImportantGenes- after running TCGA data and getting the specific classification results, see which genens have the highest expressions in that particular cluster
TCGA_getImportantGenes - plot the results of TCGA_getImportantGenes
TCGA_RJclust_removePatients - rerun clustering algorithm after removing pateints from TCGA_getImportantPatients

f_rez - compare how close two vectors of classification match
generateSimulationData - generate a large simulation dataset to use for testing
RJclust - function to do RJ clustering
scaleRJ - scale a matrix

If your num_cut variable is too large, you might run into issues. A current recommentation for num_cut is sqrt(p)
This code is specifically meant for TCGA data, but can be used on other datasets. Non-TCGA datasets might not preform as expected in this release
Currently, running TCGA_cleanData twice will result in a dataset with no columns (this is by design)
There is an Rdata set attached ('OV') that is larger than expected, but I left this in on purpsoe since the algoirthm is meant in its current form for TCGA data

rshudde/RJclust documentation built on Dec. 8, 2019, 4:06 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com