knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The typical use of the screp package assumes the use of the Seurat package to cluster cells and identify markers.This document provides a walkthorugh of the package using the pbmc data set from 10X Genomics. To begin the tutorial first load the screp and other packages needed. Then load the markers data which is the results data frame of the FindAllMarkers from the Seurat package using the pbmc data. After loading pbmc, obtain the marker table. Of course, any methods that can cluster the cells and provide genes associated with the clusters can be utilized.
## Load libraries library(screp) ## Load data set for tutorial data(markers) head(markers)
Now use the enrich_test function to perform the multiple enrichment tests for genes associated with each cluster, including enrichments in GO categories and in REACTOME Pathways.To use this function you must provide marker genes for each cluster either as a list of gene vectors or a data frame as the same format of the output from FindAllMarkers function from the Seurat package. The markers_df argument accepts the data frame of markers. when using the markers_df argument you can set the adjusted p-value you want to use to select for genes significantly associated with each cluster. We will select an adjusted p-value cutoff of 0.1 to select the markers for each cluster, which is the default of p_val argument in the enrich_test function. The function is defaulted to Human, but by adjusting the species_x and genome_genes arguments you can change species. To allow for the use of other methods to cluster the cells and identify associated genes, the user can provide a list containing vectors for significant genes for each cluster to the clust_list argument. Finaly, instead of allowing the function to generate the categories to test every time, the user can provide the categories generated by the msigdb_gsets function in the hypeR package in a named list in the cats_list argument. The list requires categories from BIOCARTA, KEGG, REACTOME, GOBP, GOMF, and GOCC and to be named as such.
### First example providing only the marker table enrich_results<-enrich_test(markers_df=markers) ### Second example generating and providing a clust_list ### Generate a list of gene vectors from the marker table sig.marks<-markers[markers$p_val_adj<0.1,] marks<-foreach(i=1:length(unique(sig.marks$cluster))) %do% { sig.marks[sig.marks$cluster==unique(sig.marks$cluster)[i],]$gene } names(marks)<-paste("cluster",unique(sig.marks$cluster),sep="_") ## check size of gene lists associated with clusters foreach(i=1:length(marks),.combine='c') %do% {length(marks[[i]])} ## example of running the function with the cluster gene list ## enrich_results<-enrich_test(clust_list=marks)
The enrich_results object is a list with two objects. The first is a list of the hypeR objects generated for each cluster that include all the results of the hypeR enrichment test including plots (for more info refer to the hypeR documentation). The second object in the enrich_results object is a list of data frames generated by the hypeR enrichment tests for each cluster.
## accessing the enrich_results object head(enrich_results$Enriched_lists$'0'$GOBP$data)[,2:5] head(enrich_results$Enriched_lists$'3'$REACTOME$data)[,2:5] head(enrich_results$Enriched_df$'0'$Pathways)[,2:5] head(enrich_results$Enriched_df$'2'$GOBP)[,2:5]
The enrich_results$Enriched_df object is used in the GO_visualization function. This function performs a semantic similairty analysis, then using the parent terms identified it generates new genesets to test enrichment by each cluster's genes for Gene Ontology-'Biological Processes' terms. The function then generates a list object containing a data frame and a ggplot object for plotting based on the parent terms from the semnatic similairty analysis. This function also utilises the marker data frame,markers_df, or cluster gene list, clust_genelist arguments. The user can choose the number of GO parent terms to display in the graph using the numcats argument, which is defaulted to 15. The default species is human, but can be changed using the species_x argument and choosing the appropriate organism database annotation package (Check Bioconductor for different organism annotation packages) using the org_db argument. This function will automatically select parent terms to graph by taking the top reoccuring parent terms across clusters. The object returned is a list with three components. The first component is the results of the semantic similarity analysis in a data frame, the second component is the data frame used for generating the ggplot object, and the final component is a ggplot2 object used for plotting. To use this function the user must provide the 'Enriched_df' from the results of the enrich_test function, vector of GOids named by GO terms to the GOterms argument, and markers via either the markers_df or clust_genelist arguments.
## Make the vector for the GO terms variable goterms<-AnnotationDbi::Term(GO.db::GOTERM) goterms<-prepGOTERMS(goterms) ## Run the GO_visualization function go_vis_res1<-GO_visualization(enrich_results$Enriched_df,markers_df=markers,goterms=goterms,numcats=10)
go_vis_res1$plot
In addition to showing the enrichment in each category across the clusters, this plot also has the percentage of each categorical geneset overlapping with the cluster associated genes plotted onto the dot size, i.e. bigger dots mean that a more of the genes in a category are found in the cluster marker genes. Now taking the results generated the user can then choose different Parent term categories to plot using the GO_viz_choose function. To use this function the user provides the above results, either the marker data frame or a cluster gene list, and the categories they want ploted using the chosen_cats argument.
## Take the results of the GO_visualization function and choose the categories to plot. newcats<-names(sort(table(go_vis_res1$GO_sem$parentTerm)))[1:10] newcats go_vis_res2<-GO_viz_choose(go_vis_res1,chosen_cats=newcats,markers_df=markers, goterms=goterms)
go_vis_res2$plot
Next we will generate a graph for the REACTOME pathway enrichment analysis. The REACTOME pathways are organized in a hierarchial structure. To construct this graph requires two tab-delimited text files from the REACTOME database website (https://reactome.org/download-data). The first file is is titled "Pathways hierarchy relationship", the second is the "Complete List of Pathways". The package includes both of these files downloaded in November 2020. However, the user may download the files and use the RPprep function to prepare the "Complete List of Pathways" for use with this package. The RPprep function only has two arguments, the first is the "Complete List of Pathways" object and the second is species argument. One thing to keep in mind is that even after using the RPprep function some NAs may be within the data frame and need to be handled before continuing. For this example, I have downloaded the text file from the REACTOME database.
## read in and prep the "Complete List of Pathways" file after downloading ## RP<-read.delim("ReactomePathways.txt",as.is=T,header=F) ## for this tutorial we will use the "Complete List of Pathways" file loaded as part of the package data(RP) RP<-RPprep(RP,"Homo sapiens") ## check for NAs before preceeding which(is.na(RP)) ## read in the "Pathways hierarchy relationship" file after downloading ## RPR<-read.delim("ReactomePathwaysRelation.txt",as.is=T,header=F) ## for this tutorial we will use the "Pathways hierarchy relationship" file loaded as part of the package data(RPR)
Now that we have the two reacomte related objects we can preceede to the generating a graph. The reactome_visualization function will generate a graph with the REACTOME pathways grouped under the top level pathways in each tree and provides and overivew or summary of the enrichment results. You can remove the highest level pathways for a specific tree to arrange the pathways under the 2nd tier pathways. In the exaample below we will remove the top level pathway "Immune system" and group the pathways under the three 2nd tier terms; "Adaptive Immune System", "Innate Immune System", and "Cytokine Signaling in the Immune System". This allows for some fine-tuning of the graph to reflect the cells in the experiment, in this case immune cells. To use the function the user must process the results object from the enrich_test function using the reactome_prep function.
## this function requires the RP and RPR objects generated previously. We will first modify the RPR object by removing the Immune System nodes from the adjacency table. To do so you must obtain the pathway ID from the REACTOME database and remove rows where its in the first column mRPR<-RPR[-(grep("R-HSA-168256",RPR[,1])),] ## Now we can use the reactome prep function RP_ready<-reactome_prep(enrich_results$Enriched_df,RP=RP,RP_adj=mRPR) head(RP_ready)
The reactome_visualization function must be provided two objects, the RP_ready object and the mRPR objects previously generated. An additional argument path_ordercan also be supplied to set the order of pathways along the x-axis. This argument accepts a vector of pathway names from the rootName column of the RP_ready object and all names must be included. This function returns a list with two ojects, the plotting data frame and a ggplot2 object.
react_vis1<-reactome_visualization(RP_ready,mRPR,RP=RP) head(react_vis1$Plotting_dataframe)
react_vis1$plot_object
## now we will generate the plot with the Pathways in alphabetical order z<-sort(unique(RP_ready$rootName)) z react_vis2<-reactome_visualization(RP_ready,mRPR,RP=RP,path_order=z)
react_vis2$plot_object
The plot generated provides several pieces of information. The dot size gives the percentage of pathways from each pathway tree enriched in the cluster, i.e the bigger the dot the the larger percentage that group of pathways makes of the total enriched pathways. For example, cluster 0 has a total of 23 enriched pathways and roughly 20% of the enriched pathways fall under the 'metabolism of proteins' pathway parent tree. The text above each dot provides the actual number of enriched pathways associated with that dot. Finally, the transparency of the dot provides insight into the percentage of pathways under a parent term that are enriched. For example Cluster 5 has 13 pathways enriched under the 'Adaptaive Immune System' tree wich is ~46% of the pathways under that tree, and the dot is solid; whereas under the "Signal Transduction" tree more pathways are enriched (29), but it is a smaller percentage of the parent tree (~7%) and the dot has a high transparency.
Both the GO_visualization and reactome_visualization functions generate ggplot2 objects which can be easily modified by users familiar with the ggpolt2 package. For users who are less experienced with ggplot2, the resultant plots can be tweaked using a program such as Adobe Illustrator. Each function also provides the data frame used to generate the plot which can easily be exported as a supplemental table for publication.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.