README.md

RJclust

Unsupervised learning clustering algorithm for data where n << p.

NOTE TO DR. GAYNANOVA: For this package, the readme and vignette both give code examples that give good exmaples of the package's functionality.

Description

Link to original paper

The RJclust algorithm is an unsupervised learning algorithm that finds clusters in data where n << p. The algorithm works by first finding the XX^t matrix, which dramatically reduces the size of the data (from n x p to n x n). Next, the algorithm divides the data into smaller sections. It uses the popular 'mclust' method to find clusters within these smaller sections. It then clusters again based on the clusters found in the subclusters. Finally, it assigns the data to one of the clusters found.

Right now, we are working on applying it to cancer data. This package will allow users to apply the RJ clustering algorithm to a dataset and get back the clusters, along with getting appropriate graphics and several functions we are using for analysis. The cancer data, or some simulated data, will be made available with the package for testing since the algorithm is intended for certain pre-processed data.

This is the start of the RJclust algorithm implementation. Right now, the functions implemented mainly are focusing on TCGA data (note that all functions that are TCGA specific start with 'TCGA').

The algorithm was developed by Dr. Valen Johnson and Dr. Shahina Rhahman, and the coding and TCGA functions implemented by Rachael Shudde.

Current functionality implemented

Coming in the future

Installation

To install, run the code below. You may need to install some libraries to make it work since this package is not from CRAN.

library(devtools)
devtools::install_github("rshudde/RJclust")

To install and build the vignette, run:

library(devtools)
devtools::install_github("rshudde/RJclust", build = TRUE, build_opts = c("--no-resave-data", "--no-manual"))

Sample run on TCGA data

Here is a quick example of the RJclust algorithm applied to some TCGA ovarian cancer data that comes with the package:

data(OV)
X = TCGA_cleanData(OV)
clust = RJclust(X, 3)
clust$G
table(clust$classification)

You can explore the clust object created to see things like the number of clusters found (\$G) or the classification of each observation (\$classification).

Sample run on simulation data

Here is an example from a simulated data example with 4 clusters (this will take about a minute to run depending on your machine):

data = generateSimulationData()
dim(data$X)
scaledX = scaleRJ(data$X)
clust = RJclust(scaledX, 150)
clust$G
table(clust$classification, data$Y)
f_rez(clust$classification, data$Y)

The f_rez compares how close the classification is to the actual data (on a scale from 0-1).

Short discription of available functions

TCGA functions

Other functions

Quick notes on some current behavior of the package



rshudde/RJclust documentation built on Dec. 8, 2019, 4:06 p.m.