**SMITE** Vignette

\section*{\centering{Abstract}} \begin{quote} This tutorial exemplifies how \textbf{SMITE} can inegrate the results from gene expression and epigenome wide studies to identify functional modules, subnetworks within a gene interaction network. Here we will take gene expression, DNA methylation and publicly available ChIP-seq results simulated from a large multi-level unpublished study, and we will generate significance-based functional modules. The main aim of our example is to find functional modules that characterize processes related to the infection of HFF cells with \textit{Toxoplasma Gondii} when compared to uninfected controls. In the process, we detail how to set-up \textbf{SMITE}, curate data for input, use \textbf{SMITE} commands, and annotate/visualize modules. \end{quote}

\section{1 Setup Environment} First, we will set our environment parameters and install the SMITE package. The package can be found on Github

options(stringsAsFactors=FALSE)
library(SMITE)

\section{2 Curate Data} Before inputting epigenetic and expression data into the SMITE algorithms, expression, DNA methylation, and annotation tracks need to be in the proper format. Because genomics data can come in many different formats, pre-processing the data is important to establish uniformity in downstream results. Within the package, there is curated, pre-processed data available for testing as well.

Start by loading in a DNA methylation dataset(Chr, Start, Stop, effect size, p-value for each CpG). This data was generated using the HELP-tagging assay and shows the difference of mean 5'-cytosine methylation between triplicate \textit{Toxoplasma} infected and uninfected Human Foreskin Fibroblasts (HFF). The p-value was generated using t-tests (DNA methylation ~ Infection).

To curate the DNA methylation dataset, it is necessary to remove uninformative loci (NAs) and p-values=0 since logarithms of p-values are employed within the code. We replace any p-values=0 with the minimum p-value in the dataset. We also remove CpGs for which the p-value is NA. The curated datasets available through Github have already been processed but were not included in the package because of size restrictions.

## Load methylation data ##
data(methylationdata)
head(methylation)
## Replace zeros with minimum non-zero p-value ##
methylation[,5] <- replace(methylation[,5], methylation[,5]==0, 
                           min(subset(methylation[,5], methylation[,5]!=0), na.rm=TRUE))

## Remove NAs from p-value column ##
methylation <- methylation[-which(is.na(methylation[, 5])), ] 

Before integrating data from multiple datasets, it is \textbf{EXTREMELY} important that you choose a single gene ID platform that you will use for the entire analysis. Any form of gene ID will work as long as every dataset uses it or efforts are made to convert gene IDs. We prefer to use gene symbols to avoid downstream networks cluttered by multiple transcripts of the same gene, and we assign gene symbols to our expression dataset instead of RefSeq IDs using a convenient function. If starting from a different gene ID system, one can also convert ensemble, ensembleprot, and uniprot gene IDs to their respective gene symbols (See Manual). The convertGeneIds function enumerates the combinations of to and from, so if a particular annotation is desired, we can add the functionality, if requested.

## Load fake genes to show expression conversion ##
data(genes_for_conversiontest)
genes[,1] <- convertGeneIds(gene_IDs=genes[,1], ID_type="refseq", ID_convert_to="symbol")

Next, we load in the example expression dataset (genes, effect, p-value) generated via RNA-seq on the same HFF samples. Because of size restrictions, we could not include an unprocessed dataset. We have, however, provided an unprocessed dataset on Github for users to apply the following lines of codes to in order to better see how to pre-process their data. In the curated data, the rownames are gene symbols and the other columns include effect size (log fold change) and p-value from a negative binomial test in DESeq.

After gene id conversion (example shown above), we remove expression data for genes where a gene symbol was not found (NAs). We then take the lowest (most significant) pvalue for genes that had more than 1 entry. Finally, we replace p-values of zeros with the lowest p-value in the dataset.

## This is just an example of how to pre-process data ##
expression <- expression[-which(is.na(expression[, 1])), ]
expression <- split(expression, expression[, 1])
expression <- lapply(expression, function(i){
    if(nrow(as.data.frame(i)) > 1){
        i <- i[which(i[, 3] == min(i[, 3], na.rm=TRUE))[1], ]
        }
    return(i)}) 
expression <- do.call(rbind, expression)
expression <- expression[,-1]
expression[, 2] <- replace(expression[, 2],expression[, 2] == 0, 
                          min(subset(expression[, 2], expression[, 2] != 0), na.rm=TRUE))
## Load expression data ##
data(curated_expressiondata)

## View data ##
head(expression_curated)

Then, we load in gene sequences as well as other data that we wish to include in our analysis. These files must be in bed format with the first three columns as (chr,start,stop). A gene strand and name are also required, but the downstream functions allow users to specify a column for strand and gene name.

## Load hg19 gene annotation BED file ##
data(hg19_genes_bed)

## Load histone modification file ##
data(histone_h3k4me1)

## View files ##
head(hg19_genes)
head(h3k4me1)

\section{3 Integrate Datasets}

In makePvalueAnnotation, we create a GRangesList object where each gene is associated with a promoter region (+/- 1000bp from TSS), gene body region (TSS+1000bp to TES), and putative "enhancers" using H3K4me1 ChIP-seq peaks (+/- 5000bp from TSS). Note: These are the sizes of regions of the genome for which we are interested in calculating scores, but the user can easily alter the parameters. Additionally, the argument otherdata can be an unlimited list of bed files that will be associated with genes and then scored in functional modules. The argument other_d is the distance from the TSS that will be used to associate each otherdata dataset with a gene. If different distances are required for otherdata, then you must provide a distance for each dataset.

test_annotation <- makePvalueAnnotation(data=hg19_genes, other_data=list(h3k4me1=h3k4me1), 
                                        gene_name_col=5, other_tss_distance=5000, 
                                        promoter_upstream_distance=1000, promoter_downstream_distance=1000) 

Having created a PvalueAnnotation, we can now load in the expression and methylation dataset. First we load the expression data. If effect_col and pval_col are not provided, the program will attempt to find them using the column names, but its probably safer to just provide the arguments.

test_annotation <- annotateExpression(pvalue_annotation=test_annotation, expr_data=expression_curated, 
                                      effect_col=1, pval_col=2)

To load in modification data, we have provided a function that can be used multiple times, once for each modifcation data type (e.g. DNA methylation, DNA hydroxymethylation). This is accomplished by using the mod_type character string, which can be any word. Please do not use an underscore or names that are nested in one another (e.g. methylation and methyl) as this will cause erratic behavior when string splitting column names downstream. For each loaded modification, when mod_corr=TRUE (DEFAULT);set as FALSE to reduce computation time) the function will determine a correlation structure, adjust the p-values and combine the p-values using the method specified (here, Stouffer's method (DEFAULT)). In addition, we use a Monte Carlo Method (MCM) of random sampling of the combined scores to determine a FDR like p-value which will used as the p-value/score in downstream analysis. When combining p-values using any method (see companion paper or ?annotateModification for more details), p-values will be combined over the gene promoters (DEFAULT), gene bodies (DEFAULT), and over any provided other datasets. Stouffer's method allows optional weights to be given, which we define here has "distance" (weighting the effect and p-value by distance from the TSS), "pval" (weighting the effect by the signifcance of the effect), or anything other text (unweighted). Note: Depending on the amount of data, this step can take roughly 10 minutes per mod_type.

test_annotation <- annotateModification(test_annotation, methylation, weight_by_method="Stouffer", 
                                      weight_by=c(promoter="distance", body="distance", h3k4me1="distance"),
                                      verbose=TRUE, mod_corr=FALSE, mod_type="methylation")

We can view the loaded data using the following functions:

## See expression data ##
head(extractExpression(test_annotation))

## See the uncombined p-values for a specific or all modType(s) ##
extractModification(test_annotation, mod_type="methylation")

## See the combined p-value data.frame ##
head(extractModSummary(test_annotation))

\section{4 Adjusting Values and Scoring}

Having loaded gene expression and DNA methylation data into the PvalueAnnotation, we now bring all of the data together into an object class called a PvalueObject. The PvalueObject we will create has a slot within the PvalueAnnotation called score_data accessed through slot(PvalueAnnotation, "scoredata") or one of the accessor functions that we have set up like SMITEextractScores. At this step we specify an \textit{a priori} argument that will be used in scoring downstream effect_directions. We must specify an effect_directions element for each modification-context pairing that we wish to score. It should reflect whether you want your modification-context (e.g. DNA methylation in gene promoters) to have a pre-specified relationship with expression so that if the opposite relationship is observed the score will be penalized. Users can specify either "increase", "decrease" and "bidirectional" (for no prespecifed relationship). This information will be stored and will not be used until the SMITEscorePval function is used.

test_annotation <- makePvalueObject(test_annotation, effect_directions=c(methylation_promoter="decrease", 
                                                                         methylation_body="increase", methylation_h3k4me1="bidirectional"))

P-values/scores that are combined and then randomized will have a different range than gene expression, or one another, which can skew the identified modules toward a particular modification-context pairing. Using the plotDensityPval function we view the distribution and then normalize using the normalizePval function if necessary. Using the (DEFAULT) rescale procedure, p-values are logit transformed before rescaling resulting in a shifting of an approximately normal distribution and no effect on the order of p-values within a modification-context pairing.

## Because runSpinglass and scorePval are computationally intensive we simply load already ran data instead of evaluating the code.
data(test_annotation_score_data)
## Plot density of p-values ##
plotDensityPval(pvalue_annotation=test_annotation, ref="expression")
## Normalize the p-values to the range of expression ##
test_annotation <- normalizePval(test_annotation, ref="expression", 
                                 method="rescale")

We can compare p-values for different features using the plotCompareScores function, with the hope that it may reveal interesting patterns within the data.

plotCompareScores(test_annotation, x_name="expression", y_name="methylation_promoter")

Scoring the genes is the final step before finding modules, and after scoring, the high scoring genes can be extracted using the SMITEhighScores function. When using SMITEscorePval, users can provide an optional weighting vector that will allow prioritization of certain modfication-context pairings. We included this feature because we find that a researcher often has a specific goal (e.g. finding DNA methylation that is significantly different at enhancer), and rather than allowing researchers to selectively choose results that support their hypothesis, it may be beneficial to define a numeric quantity that can be reported at the time of publication and reproduced. We provide the SMITEreport function for text dump of all defined parameters.

#score with all four features contributing
test_annotation <- scorePval(test_annotation, weights=c(methylation_promoter=.3, 
                                                        methylation_body=.1,
                                                        expression=.3,
                                                        methylation_h3k4me1=.3))

\section{5 Visualize Modules} Now that we have generated weighted significance values from different modifications that relate to each gene, we can visualize modules using interactome data and module-building packages, such as BioNet. Specifically, following the example of Epimods we use igraph's spinglass algorithm to determine the best modules.

First, we load the desired interactome, in our case from REACTOME and run the Spin-glass algorithm. We can then run a goseq like approach on our modules to annotate them using pathways, like KEGG.

## load REACTOME ## 
 load(system.file("data","Reactome.Symbol.Igraph.rda", package="SMITE"))
## Run Spin-glass ##
test_annotation <- runSpinglass(test_annotation, network=REACTOME, maxsize=50, num_iterations=1000)

## Run goseq on individual modules to determine bias an annotate functions ## 
test_annotation <- runGOseq(test_annotation, supply_cov=TRUE, coverage=read.table(system.file(
    "extdata", "hg19_symbol_hpaii.sites.inbodyand2kbupstream.bed.gz", package="SMITE")),
    type="kegg")

We can use keywords such as "cell cycle" to find within the goseq output which modules we are most interested. We can then use SMITEplotModule to visualize the module.

## search goseq output for keywords ##
searchGOseq(test_annotation, search_string="cell cycle")
## Draw a network ##
plotModule(test_annotation, which_network=6, layout="fr", label_scale=TRUE, compare_plot = FALSE)
## Compare two networks ##
plotModule(test_annotation, which_network=c(4,6), layout="fr", label_scale=TRUE, compare_plot = TRUE)
## Draw a network with goseq analysis ##
plotModule(test_annotation, which_network=1, layout="fr", goseq=TRUE, label_scale=FALSE)


Try the SMITE package in your browser

Any scripts or data that you put into this service are public.

SMITE documentation built on Nov. 8, 2020, 5:14 p.m.