knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# library(idepGolem)
devtools::load_all()
# Make data for pre-processing functions
idep_data <- get_idep_data()

# YOUR_DATA_PATH <- "E:/idep_9_24/data/data103/data_go/BcellGSE71176_p53.csv"
# YOUR_EXPERIMENT_PATH <- "E:/idep_9_24/data/data103/data_go/BcellGSE71176_p53_sampleInfo.csv"

DATABASE <- Sys.getenv("GE_DATABASE")[1]
YOUR_DATA_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53.csv")
YOUR_EXPERIMENT_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53_sampleInfo.csv")


expression_file <- data.frame(
  datapath = YOUR_DATA_PATH
)
experiment_file <- data.frame(
  datapath = YOUR_EXPERIMENT_PATH
)

load_data <- input_data(
  expression_file = expression_file,
  experiment_file = experiment_file,
  go_button = FALSE,
  demo_data_file = idep_data$demo_data_file,
  demo_metadata_file = idep_data$demo_metadata_file
)

converted <- convert_id(
  query = rownames(load_data$data),
  idep_data = idep_data,
  select_org = "BestMatch"
)

all_gene_info <- gene_info(
  converted = converted,
  select_org = "BestMatch",
  idep_data = idep_data
)

converted_data <- convert_data(
  converted = converted,
  no_id_conversion = FALSE,
  data = load_data$data
)

gene_names <- get_all_gene_names(
  mapped_ids = converted_data$mapped_ids,
  all_gene_info = all_gene_info
)

processed_data <- pre_process(
  data = converted_data$data,
  missing_value = "geneMedian",
  data_file_format = 1,
  low_filter_fpkm = NULL,
  n_min_samples_fpkm = NULL,
  log_transform_fpkm = NULL,
  log_start_fpkm = NULL,
  min_counts = .5,
  n_min_samples_count = 1,
  counts_transform = 1,
  counts_log_start = 4,
  no_fdr = NULL
)

Clustering is another visualization topic that builds off the first two tabs of the iDEP application. The package provides methods to cluster and visualize the processed data. There are other visualization tools in this package that go along with the clustering that is performed. There is the option to perform hierarchical or k-means clustering for the main heatmap which will be covered in the next section. Using the functions below require to first read and use the operations in the Load_Data and Pre_Process instructions.

Determine Cluster Parameters

The sections below help pick parameters for the clustered heatmap.

Gene Range

The main heatmap for the clustering has inputs for the number of clusters in the case of k-means and the number of genes to plot in the heatmap. There are two plots that can help determine good inputs for these parameters. The first is the function sd_density and shows the standard deviation distribution of the processed data. There are a max and min genes parameters that will provide lines to show the standard deviation of the desired range of genes. This will help give an idea of the range of genes to plot in the heatmap.

sd_density(
  data = processed_data$data,
  n_genes_max = 200
)


k-Clusters

To select the number of clusters to use in the case fo k-means, we can use the iDEP function to create an elbow plot. To use this function, we are first going to have to subset the data with the process_heatmap_data function. The input parameters for this function can be viewed in the help documentation for the function. We will store the data matrix with the transformations performed to it in the object heatmap_data. This data frame will be used in the function to create the scree plot. These two functions are demonstrated below.

For information on how to analyze and interpret the elbow plot, please visit this link https://bl.ocks.org/rpgove/0060ff3b656618e9136b.

heatmap_data <- process_heatmap_data(
  data = processed_data$data,
  n_genes_max = 200,
  # n_genes_min = 50,
  gene_centering = TRUE,
  gene_normalize = TRUE,
  sample_centering = TRUE,
  sample_normalize = TRUE,
  all_gene_names = gene_names,
  select_gene_id = "symbol"
)

k_means_elbow(heatmap_data = heatmap_data)


Main Heatmap

Now that we have chosen the number of genes to plot and the heatmap data has been centered and standardized appropriately. cluster_meth provides two different methods that can be used for the heatmap. The input options can be seen in the table below.

Input Description ------------ ----------------- "1" Hierarchical clustering (Visit https://www.displayr.com/what-is-hierarchical-clustering/) "2" k-Means clustering (Visit https://stanford.edu/~cpiech/cs221/handouts/kmeans.html)

The next input heatmap_cutoff, will filter the data to ensure that there are no outliers that would throw off the color scaling of the heatmap. sample_info is the same experiment file that has been used in the previous steps. This will control the color bar legend on the heatmap and how the samples are grouped. The options for this input can be found with the call print(c("Sample_Name", colnames(pre_process$sample_info()))). There are 3 choices for the distance method to be used. The function call dist_functions give the three methods, with a numeric string input to specify which one to use. Using the distance method, there are 6 possible inputs for the clustering method. Details on these methods are in the table below.

Distance Input Method Descrpition


"1" 1 - (Pearson's Correlation Coefficient) (Pearson's Correlation Visit https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/) "2" 1 - abs(Pearson's Correlation Coefficient) "3" Euclidean distance with the dist function in R (Visit https://www.cuemath.com/euclidean-distance-formula/)

Clustering Input Method Description


"average" The distance between two clusters is the average distance between an observation in one cluster and an observation in the other cluster. (Visit for all methods https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how-to/cluster-observations/methods-and-formulas/linkage-methods/#median) "complete" The distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. "single" The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster.
"median" The distance between two clusters is the median distance between an observation in one cluster and an observation in the other cluster. "centroid" The distance between two clusters is the distance between the cluster centroids or means. "mcquitty" The distance is calculated with a distance matrix. See the link above for formula and details.

The inputno_sample_clustering controls whether the columns of the heatmap are clustered or not (TRUE/FALSE). heatmap_color_select sets the color scale that is used in the heatmap. A vector of three colors should be inputted as demonstrated below. row_dend is TRUE/FALSE for whether to plot the row dendogram with the heatmap or not. For k-means clusterings, use k_clusters to specify the number of clusters to use. re-run is a shiny input and should be set to FALSE. An example heatmap can be seen below, and is stored in the object ht.

ht <- heatmap_main(
  data = heatmap_data,
  cluster_meth = "1",
  heatmap_cutoff = 4,
  sample_info = load_data$sample_info,
  select_factors_heatmap = "Sample_Name",
  dist_funs = dist_functions(),
  dist_function = "1",
  hclust_function = "average",
  no_sample_clustering = FALSE,
  heatmap_color_select = c("red", "black", "green"),
  row_dend = TRUE,
  k_clusters = NULL,
  re_run = FALSE
)

Due to the interactivity used in the shiny app, the heatmap function shows very little information. To create an interactive app with the heatmap from above, use the code below:

InteractiveComplexHeatmap::ht_shiny(ht)


Correlation Matrix

Another useful visualization from the iDEP package is a correlation matrix to show the correaltion between the different samples in the expression data. It shows the relationship between the expression values for the different samples. The values in the matrix are primarily strong positive correlations. This is because for the vast majority of the genes the expression values are very similar. The color scale in the correlation plot helps illustrate the slight differences in correlation and what samples have the strongest relationship. The function cor_plot will create this plot. The data input is the entire processed datat matrix. label_pcc as TRUE will paste the correlation values in each cell of the correlation heatmap. heat_cols is a vector of three colors to use for the heatmap color scale. text_col controls the color of the correlation labels. The code block below demonstrates this function.

cor_plot(
  data = processed_data$data,
  label_pcc = TRUE,
  heat_cols = c("red", "black", "green"),
  text_col = "white"
)


Sample Tree

Another clustering feature is the function to draw a dendogram to show the clustering of the samples. This will give an idea of how closely the expression data of each sample is linked to another sample. The clustering is hierarchical and uses the same functions that were described for the heatmap above. We can perform centering and normalizing across genes and samples with the corresponding input parameters. hcluster_functions stores the different linkage methods that can be used with the dendogram. The method is specified differently than the heatmap above. hcluster_functions() is filled into hclust_funs, and the desired method is inputted in hcluster_function. The input for the distance method is the same as the heatmap and is explained above. The chunk below shows an example call.

draw_sample_tree(
  tree_data = processed_data$data,
  gene_centering = TRUE,
  gene_normalize = FALSE,
  sample_centering = FALSE,
  sample_normalize = FALSE,
  hclust_funs = hcluster_functions(),
  hclust_function = "average",
  dist_funs = dist_functions(),
  dist_function = "1"
)


Running Enrichment

It is possible with iDEP to run enrichment analysis with the genes that are selected in the process_heatmap_data function. This process is explained in the "Enrichment" instruction. If you wish to perform enrichment with the heatmap_data, save that object in your environment and use the functions described in that document.

Conclusion

This document covered the function used in the Clustering tab of the iDEP webpage. It demonstrated many visualization techniques, but is in no way exhaustive of the methods that can be used with the loaded data. For troubleshooting, all functions have documentation and the code is available on Github.



espors/idepGolem documentation built on Oct. 27, 2024, 4:56 a.m.