# knitr::opts_chunk$set(
#   collapse = TRUE
# )
library(RJclust)

RJ CLUST

The purpose of this package is to implement the scaled RJ clust algorithm. This algorithm has not been completely finihsed, but is functional for TCGA datasets and datasets with no missing values. The README file has more information.

The purpose of this vignette is to walk through an example with a small TCGA cancer dataset to see how the algorithm should be used.

Step 1 - exploratory data analysis and preprocessing

This package is meant to be used on TCGA data. These datasets are too large to upload with the package (the BRCA datset is 1000 x 20350), so I have provided a small dataset of 304 observations of ovarian cancer with 1000 randomely selected genes.

First, let's look at the data. You can see that the data is originally set up as a 1000x306 matrix.

data(OV) # load data
print(dim(OV))
OV[1:5, 3:5]

Now, we can use the TCGA_cleanData function. This function does several things, including transposing the data and removing columns where the percent of zero entries are above some threshold. Note this dataset doesn't have many columns with a high percentage of zeroes. Note 110 columns were removed, some because they had too many zeroes, some for various other reasons (no gene name, for example).

X = TCGA_cleanData(OV, 50)
print(dim(X))

Step 2 - run RJ algorithm

Let's run the RJ algorithm and look at the results.

clust = RJclust(X, 3)

We can see there are 3 clusters, one larger cluster and two smaller clusters.

clust$G
table(clust$classification)

There are many more things we can display and learn about the clust object - including the model name, loglikelihood, degrees of freedom, etc.

clust

Step 3 - examine patients and genes

We can look at the $XX^t$ matrix and see which columns have the largest means - these patients might be really influential in clustering. It may be that there was a mistake in collecting their data or they have some really high gene expressions (I am not a biologist and can give no better explanation). We can look at the head of the 'patients' dataframe and look at waht patients might be the most important (note they all come frmo the same lab).

patients = TCGA_getImportantPatients(X)

head(patients$patientSums)

We can plot these results to get some indication of what to expect - we see that the top 25 patients we plotted were well above the mean

TCGA_plotImportantPatients(patients$patientSums, 25, patients$mean)

Step 3 - rerun RJ clust without top 25 patients

We can rerun the algorithm without these patients with high gene expressions and see if we get different results.

clustRemoved = TCGA_RJclust_removePatients(X, 3, 25)

Step 4 - compare classification results

Let's see how similar the two different RJclust runs are by using the 'f_rez' function - this calculates the AMI of the two calculations. First we need to get the overlap of patients, then we can compare. We can see that removing those 25 patients did cause the classification results to change pretty drastically.

clustClass = as.data.frame(clust$classification)
clustRemovedClass = as.data.frame(clustRemoved$classification)
mergedRJresults = merge(clustClass, clustRemovedClass, by = "row.names")
colnames(mergedRJresults) = c("Patients", "Original", "Removed")

f_rez(mergedRJresults$Original, mergedRJresults$Removed)

Step 5 - look at imporant genes

We can now look at the original RJ results and see which genes had the largest impact on classification and plot the results. Note that many of the 'most important genes' are the same accross the three groups. Note here we tell it to plot the top 25 genes, but the algorithm is finding fewer genes were above the 2nd standard deviation, so it's only plotting the significant ones. This produces a warning, but we can ignore it.

group1 = TCGA_getImportantGenes(OV, clust$classification, 50, cluster = 1)
group2 = TCGA_getImportantGenes(OV, clust$classification, 50, cluster = 2)
group3 = TCGA_getImportantGenes(OV, clust$classification, 50, cluster = 3)

Now we can plot the results (note once again the warning is expected - just said there are not 25 genes with expression levels over 2*standard deviation found)

par(mfrow = c(2,2))
TCGA_plotImportantGenes(group1$importanceList, 25, group1$mean)
TCGA_plotImportantGenes(group2$importanceList, 25, group2$mean)
TCGA_plotImportantGenes(group2$importanceList, 25, group2$mean)

Let's run the same code, but on the RJclust results with the 25 patients with high gene expressions removed - you can see there is a big difference!

group1_R = TCGA_getImportantGenes(OV, clustRemoved$classification, 50, cluster = 1)
group2_R = TCGA_getImportantGenes(OV, clustRemoved$classification, 50, cluster = 2)
group3_R = TCGA_getImportantGenes(OV, clustRemoved$classification, 50, cluster = 3)

Now we can plot the results (note once again the warning is expected - just said there are not 25 genes with expression levels over 2*standard deviation found)

par(mfrow = c(2,2))
TCGA_plotImportantGenes(group1_R$importanceList, 25, group1_R$mean)
TCGA_plotImportantGenes(group2_R$importanceList, 25, group2_R$mean)
TCGA_plotImportantGenes(group3_R$importanceList, 25, group3_R$mean)

Conclusion

Overall, these results are not meaningful since the data is run on a subsetted dataset. The purpose of this vignette is to show the usefulness of this algorithm. In the future, more functionality will be provided, like comparing variables like race, sex, ER positive status, etc across different clusters.



rshudde/RJclust documentation built on Dec. 8, 2019, 4:06 p.m.