knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
# library(idepGolem) devtools::load_all()
# Make data for pre-processing functions idep_data <- get_idep_data() # YOUR_DATA_PATH <- "E:/idep_9_24/data/data103/data_go/BcellGSE71176_p53.csv" # YOUR_EXPERIMENT_PATH <- "E:/idep_9_24/data/data103/data_go/BcellGSE71176_p53_sampleInfo.csv" DATABASE <- Sys.getenv("GE_DATABASE")[1] YOUR_DATA_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53.csv") YOUR_EXPERIMENT_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53_sampleInfo.csv") expression_file <- data.frame( datapath = YOUR_DATA_PATH ) experiment_file <- data.frame( datapath = YOUR_EXPERIMENT_PATH ) load_data <- input_data( expression_file = expression_file, experiment_file = experiment_file, go_button = FALSE, demo_data_file = idep_data$demo_data_file, demo_metadata_file = idep_data$demo_metadata_file ) converted <- convert_id( query = rownames(load_data$data), idep_data = idep_data, select_org = "BestMatch" ) all_gene_info <- gene_info( converted = converted, select_org = "BestMatch", idep_data = idep_data ) converted_data <- convert_data( converted = converted, no_id_conversion = FALSE, data = load_data$data ) gene_names <- get_all_gene_names( mapped_ids = converted_data$mapped_ids, all_gene_info = all_gene_info ) processed_data <- pre_process( data = converted_data$data, missing_value = "geneMedian", data_file_format = 1, low_filter_fpkm = NULL, n_min_samples_fpkm = NULL, log_transform_fpkm = NULL, log_start_fpkm = NULL, min_counts = .5, n_min_samples_count = 1, counts_transform = 1, counts_log_start = 4, no_fdr = NULL )
Clustering is another visualization topic that builds off the first two tabs of the iDEP application. The package provides methods to cluster and visualize the processed data. There are other visualization tools in this package that go along with the clustering that is performed. There is the option to perform hierarchical or k-means clustering for the main heatmap which will be covered in the next section. Using the functions below require to first read and use the operations in the Load_Data and Pre_Process instructions.
The sections below help pick parameters for the clustered heatmap.
The main heatmap for the clustering has inputs for the number of clusters in the
case of k-means and the number of genes to plot in the heatmap. There are two
plots that can help determine good inputs for these parameters. The first is the
function sd_density
and shows the standard deviation distribution of the
processed data. There are a max and min genes parameters that will provide lines
to show the standard deviation of the desired range of genes. This will help
give an idea of the range of genes to plot in the heatmap.
sd_density( data = processed_data$data, n_genes_max = 200 )
To select the number of clusters to use in the case fo k-means, we can use the
iDEP function to create an elbow plot. To use this function, we are first going
to have to subset the data with the process_heatmap_data
function. The input
parameters for this function can be viewed in the help documentation for the
function. We will store the data matrix with the transformations performed to it
in the object heatmap_data
. This data frame will be used in the function to
create the scree plot. These two functions are demonstrated below.
For information on how to analyze and interpret the elbow plot, please visit this link https://bl.ocks.org/rpgove/0060ff3b656618e9136b.
heatmap_data <- process_heatmap_data( data = processed_data$data, n_genes_max = 200, # n_genes_min = 50, gene_centering = TRUE, gene_normalize = TRUE, sample_centering = TRUE, sample_normalize = TRUE, all_gene_names = gene_names, select_gene_id = "symbol" ) k_means_elbow(heatmap_data = heatmap_data)
Now that we have chosen the number of genes to plot and the heatmap data has
been centered and standardized appropriately. cluster_meth
provides two
different methods that can be used for the heatmap. The input options can be
seen in the table below.
Input Description ------------ ----------------- "1" Hierarchical clustering (Visit https://www.displayr.com/what-is-hierarchical-clustering/) "2" k-Means clustering (Visit https://stanford.edu/~cpiech/cs221/handouts/kmeans.html)
The next input heatmap_cutoff
, will filter the data to ensure that there are
no outliers that would throw off the color scaling of the heatmap. sample_info
is the same experiment file that has been used in the previous steps. This will
control the color bar legend on the heatmap and how the samples are grouped. The
options for this input can be found with the call
print(c("Sample_Name", colnames(pre_process$sample_info())))
. There are
3 choices for the distance method to be used. The function call dist_functions
give the three methods, with a numeric string input to specify which one to use.
Using the distance method, there are 6 possible inputs for the clustering
method. Details on these methods are in the table below.
Distance Input Method Descrpition
"1" 1 - (Pearson's Correlation Coefficient) (Pearson's Correlation Visit https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/)
"2" 1 - abs(Pearson's Correlation Coefficient)
"3" Euclidean distance with the dist
function in R (Visit https://www.cuemath.com/euclidean-distance-formula/)
Clustering Input Method Description
"average" The distance between two clusters is the average distance between an observation in one cluster and an observation in the other cluster. (Visit for all methods https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how-to/cluster-observations/methods-and-formulas/linkage-methods/#median)
"complete" The distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster.
"single" The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster.
"median" The distance between two clusters is the median distance between an observation in one cluster and an observation in the other cluster.
"centroid" The distance between two clusters is the distance between the cluster centroids or means.
"mcquitty" The distance is calculated with a distance matrix. See the link above for formula and details.
The inputno_sample_clustering
controls whether the columns of the heatmap are
clustered or not (TRUE/FALSE). heatmap_color_select
sets the color scale that
is used in the heatmap. A vector of three colors should be inputted as
demonstrated below. row_dend
is TRUE/FALSE for whether to plot the row
dendogram with the heatmap or not. For k-means clusterings, use k_clusters
to
specify the number of clusters to use. re-run
is a shiny input and should be
set to FALSE. An example heatmap can be seen below, and is stored in the object
ht
.
ht <- heatmap_main( data = heatmap_data, cluster_meth = "1", heatmap_cutoff = 4, sample_info = load_data$sample_info, select_factors_heatmap = "Sample_Name", dist_funs = dist_functions(), dist_function = "1", hclust_function = "average", no_sample_clustering = FALSE, heatmap_color_select = c("red", "black", "green"), row_dend = TRUE, k_clusters = NULL, re_run = FALSE )
Due to the interactivity used in the shiny app, the heatmap function shows very little information. To create an interactive app with the heatmap from above, use the code below:
InteractiveComplexHeatmap::ht_shiny(ht)
Another useful visualization from the iDEP package is a correlation matrix to
show the correaltion between the different samples in the expression data. It
shows the relationship between the expression values for the different samples.
The values in the matrix are primarily strong positive correlations. This is
because for the vast majority of the genes the expression values are very
similar. The color scale in the correlation plot helps illustrate the slight
differences in correlation and what samples have the strongest relationship. The
function cor_plot
will create this plot. The data
input is the entire
processed datat matrix. label_pcc
as TRUE will paste the correlation values in
each cell of the correlation heatmap. heat_cols
is a vector of three colors
to use for the heatmap color scale. text_col
controls the color of the
correlation labels. The code block below demonstrates this function.
cor_plot( data = processed_data$data, label_pcc = TRUE, heat_cols = c("red", "black", "green"), text_col = "white" )
Another clustering feature is the function to draw a dendogram to show the
clustering of the samples. This will give an idea of how closely the expression
data of each sample is linked to another sample. The clustering is hierarchical
and uses the same functions that were described for the heatmap above. We can
perform centering and normalizing across genes and samples with the
corresponding input parameters. hcluster_functions
stores the different
linkage methods that can be used with the dendogram. The method is specified
differently than the heatmap above. hcluster_functions()
is filled into
hclust_funs
, and the desired method is inputted in hcluster_function
. The
input for the distance method is the same as the heatmap and is explained above.
The chunk below shows an example call.
draw_sample_tree( tree_data = processed_data$data, gene_centering = TRUE, gene_normalize = FALSE, sample_centering = FALSE, sample_normalize = FALSE, hclust_funs = hcluster_functions(), hclust_function = "average", dist_funs = dist_functions(), dist_function = "1" )
It is possible with iDEP to run enrichment analysis with the genes that are
selected in the process_heatmap_data
function. This process is explained in
the "Enrichment" instruction. If you wish to perform enrichment with the
heatmap_data
, save that object in your environment and use the functions
described in that document.
This document covered the function used in the Clustering tab of the iDEP webpage. It demonstrated many visualization techniques, but is in no way exhaustive of the methods that can be used with the loaded data. For troubleshooting, all functions have documentation and the code is available on Github.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.