Before starting, the user should choose a working directory, preferably a directory devoted exclusively for this tutorial. After starting an R session, change working directory, load the requisite packages and set standard options:
# If necessary, change the path below to the directory where the data files are stored. # "." means current directory. On Windows use a forward slash / instead of the usual \. workingDir <- "." setwd(workingDir) # Load packages librarian::shelf(magrittr, tidyverse, data.table, glue, WGCNA, BiocParallel) library(WGCNA) # The following setting is important, do not omit. options(stringsAsFactors = FALSE) knitr::opts_chunk$set(fig.align = "center")
In this section we illustrate simulation of expression data with a simple module structure.
We choose the basic parameters of the simulated data set: we will simulate 3000 genes in 50 samples. The genes will fall into five proper modules (labeled turquoise, blue, brown, green, and yellow) and a relatively large number of genes will be simulated outside of the proper modules (“grey” genes).
# Here are input parameters of the simulation model # number of samples or microarrays in the training data no.obs <- 50 # now we specify the true measures of eigengene significance # recall that ESturquoise=cor(y,MEturquoise) ESturquoise <- 0 ESbrown <- -.6 ESgreen <- .6 ESyellow <- 0 # Note that we don’t specify the eigengene significance of the blue module # since it is highly correlated with the turquoise module. ESvector <- c(ESturquoise, ESbrown, ESgreen, ESyellow) # number of genes nGenes1 <- 3000 # proportion of genes in the turquoise, blue, brown, green, and yellow module #respectively. simulateProportions1 <- c(0.2, 0.15, 0.08, 0.06, 0.04) # Note that the proportions don’t add up to 1. The remaining genes will be colored grey, # ie the grey genes are non-module genes. # set the seed of the random number generator. As a homework exercise change this seed. set.seed(1) # Step 1: simulate a module eigengene network. # Training Data Set I MEgreen <- rnorm(no.obs) scaledy <- MEgreen * ESgreen + sqrt(1 - ESgreen^2) * rnorm(no.obs) y <- ifelse(scaledy > median(scaledy), 2, 1) MEturquoise <- ESturquoise * scaledy + sqrt(1 - ESturquoise^2) * rnorm(no.obs) # we simulate a strong dependence between MEblue and MEturquoise MEblue <- .6 * MEturquoise + sqrt(1 - .6^2) * rnorm(no.obs) MEbrown <- ESbrown * scaledy + sqrt(1 - ESbrown^2) * rnorm(no.obs) MEyellow <- ESyellow * scaledy + sqrt(1 - ESyellow^2) * rnorm(no.obs) ModuleEigengeneNetwork1 <- data.frame(y, MEturquoise, MEblue, MEbrown, MEgreen, MEyellow)
The variable ModuleEigengeneNetwork1
contains the "seed" eigengenes and a simulated clinical trait y
. The eigengene network can be simply inspected by:
signif(cor(ModuleEigengeneNetwork1, use="p"),2)
The package contains a convenient function to simulate five modules which we call below:
dat1 <- simulateDatExpr5Modules( MEturquoise = ModuleEigengeneNetwork1$MEturquoise, MEblue = ModuleEigengeneNetwork1$MEblue, MEbrown = ModuleEigengeneNetwork1$MEbrown, MEyellow = ModuleEigengeneNetwork1$MEyellow, MEgreen = ModuleEigengeneNetwork1$MEgreen, nGenes = nGenes1, simulateProportions = simulateProportions1 )
The simulated data (dat1
) is a list with the following components:
names(dat1)
We attach the data “into the main search path” so we can use the component names directly, without referring to dat1
:
datExpr <- dat1$datExpr truemodule <- dat1$truemodule datME <- dat1$datME attach(ModuleEigengeneNetwork1)
To see what is in the data, we can simply type
table(truemodule) dim(datExpr)
with the result
dim(datExpr)
table(truemodule)
The output indicated we simulated 3000 genes in 50 samples, and a large number of the genes are “grey”, i.e., genes that are not part of any proper module. the next piece of code assigns gene and sample names to columns and rows of the expression data
datExpr <- data.frame(datExpr) ArrayName <- paste("Sample", 1:dim(datExpr)[[1]], sep = "") # The following code is useful for outputting the simulated data GeneName <- paste("Gene", 1:dim(datExpr)[[2]], sep = "") dimnames(datExpr)[[1]] <- ArrayName dimnames(datExpr)[[2]] <- GeneName
In this section we illustrate loading of expression data and clinical trait. The files holding the information are provided on the main tutorial page.
datGeneSummary <- read.csv("../data/GeneSummaryTutorial.csv") datTraits <- read.csv("../data/TraitsTutorial.csv") datMicroarrays <- read.csv("../data/MicroarrayDataTutorial.csv")
A quick look at the content of datMicroarrays
:
datMicroarrays[1:5,1:5]
We now reformat the data and set appropriate gene and sample names:
# This vector contains the microarray sample names ArrayName <- names(data.frame(datMicroarrays[, -1])) # This vector contains the gene names GeneName <- datMicroarrays$GeneName # We transpose the data so that the rows correspond to samples and the columns correspond to genes # Since the first column contains the gene names, we exclude it. datExpr <- data.frame(t(datMicroarrays[, -1])) names(datExpr) <- datMicroarrays[, 1] dimnames(datExpr)[[1]] <- names(data.frame(datMicroarrays[, -1])) # Also, since we simulated the data, we know the true module color: truemodule <- datGeneSummary$truemodule rm(datMicroarrays)
The first few entries in datExpr
are now
datExpr[1:5,1:5]
The input gene expression data should have the above format where column names correspond to gene (or probe) names, row names correspond to array names. We now isolate the microarray trait y
from the read data
# First, make sure that the array names in the file datTraits line up with those in the microarray data table(dimnames(datExpr)[[1]] == datTraits$ArrayName) # Next, define the microarray sample trait y <- datTraits$Trait.y
Because the loaded data are identical to the simulated ones, we do not need to save the results here. Subsequent sections of the tutorial use the results of the data simulation section (Section 1).
In this section we illustrate basic data cleaning and pre-processing steps for expression data.
We start by determining the mean expression per array and the number of missing values per array:
meanExpressionByArray <- apply(datExpr, 1, mean, na.rm = T) NumberMissingByArray <- apply(is.na(data.frame(datExpr)), 1, sum)
A simple way to examine the mean expression per array is to use ```r"} tibble(sample = 1:length(meanExpressionByArray), meanexpression = meanExpressionByArray) %>% ggplot(aes(x = as.factor(sample), y = meanexpression)) + geom_col() + labs(x = "Sample", y = "Mean expression", title = "Mean expression across samples") + theme(panel.background = element_rect(fill = NA, color = 'black'), axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
whose output is shown in Figure 3.1\ref{fig3.1}. No arrays in the plot seem to have an outlying mean expression value. The numbers of missing entries in each array are ```r NumberMissingByArray
We note that arrays with excessive numbers of missing data should be removed, for example as
# Keep only arrays containing less than 500 missing entries KeepArray <- NumberMissingByArray < 500 table(KeepArray) datExpr <- datExpr[KeepArray, ] y <- y[KeepArray] ArrayName[KeepArray]
Here we count the number of missing samples in each probe profile, and remove probes with extensive numbers of missing samples. In addition, we remove probes that do not vary at all.
NumberMissingByGene <- apply(is.na(data.frame(datExpr)), 2, sum) # One could do a barplot(NumberMissingByGene), but the barplot is empty in this case. # It may be better to look at the numbers of missing samples using the summary method: summary(NumberMissingByGene) # Calculate the variances of the probes and the number of present entries variancedatExpr <- as.vector(apply(as.matrix(datExpr), 2, var, na.rm = T)) no.presentdatExpr <- as.vector(apply(!is.na(as.matrix(datExpr)), 2, sum)) # Another way of summarizing the number of pressent entries table(no.presentdatExpr) # Keep only genes whose variance is non-zero and have at least 4 present entries KeepGenes <- variancedatExpr > 0 & no.presentdatExpr >= 4 table(KeepGenes) datExpr <- datExpr[, KeepGenes] GeneName <- GeneName[KeepGenes]
In this case, since the data is simulated without missing data or zero-variance probes, all probes are retained.
We use hierarchical clustering with the Euclidean distance to determine whether there are array (sample) outliers:
r"}
sizeGrWindow(9, 5)
plotClusterTreeSamples(datExpr = datExpr, y = y)
The output is shown in Figure 3.2\ref{fig3.2}. There are no no obvious array outliers. If there are suscpicious samples, they should be removed from the analysis. In the figure, the microarray samples are colored by the outcome y
(1=black, 2=red). Since the colors dont line up with the branches, we find no evidence that samples with y
= 1 are "globally distinct" from those with y
= 2. When clustering microarray samples, we recommend to use the Euclidean distance. In contrast, we recommend to use the topological overlap based dissimilarity for clustering genes. Below, we investigate different methods for clustering genes (gene expression profiles).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.