knitr::opts_chunk$set( collapse = TRUE, comment = "#>", library(kableExtra), library(geometric2) )
Before reading on, I recommend to go through the tutorial designed by the authors of the limma package. It's hosted on F1000Research.
This process is difficult to stream line. In contrast to microarrays the series matrix files don't contain any experimental data. Ideally normalized data is also submitted to GEO and this can be used instead of a series matrix file. There don't appear to be any conventions on file format or anything else. Generally, the relevant files are listed towards the bottom of the GSE entry website.
In this vignette, the GSE81889_Normalized_Ctrl.vs.IFNa.txt file is processed. Note the file needs to be unzipped first.
In contrast to the microarray experience no sample_review.txt file is created here. The relevant meta data is extracted manually. The get_DSinfo() command can still be run, see the annotation vignette for details.
The format of these normalized files is not standardized. Hence, it's necessary to have a look at the file first before proceeding:
countdata_lines <- read_lines("GSE81889_Normalized_Ctrl.vs.IFNa.txt") kable(head(countdata_lines))
This quick look reveals that the first two lines are headers and the data starts in line 3. R does not support multi-line headers. In the next step the first two lines are concatonated to yield a single header line, The data is read into a table starting from the third line. Lastly the headers are added to the table.
kable(header_lines <- read_lines("GSE81889_Normalized_Ctrl.vs.IFNa.txt", n_max = 2)) split_header_lines <- str_split(header_lines, pattern = "\t") head1 <- unlist(split_header_lines[1]) kable(head1)
This yields the following two vectors:
head2 <- unlist(split_header_lines[2]) kable(head2)
Combining those makes suitable column names:
kable(headers <- c(head2[2:3], head1[4:9])) kable(headers)
countdata <- read.table("GSE81889_Normalized_Ctrl.vs.IFNa.txt", skip = 2, header = FALSE, row.names = 1) colnames(countdata) <- headers[1:8] kable(head(countdata))
In the previous steps we created a r class(countdata)
. For use is limma it
needs to be converted into a matrix. A crucial difference between the two classes
is that data frames accept different types of data whereas matrices do not.
In order to maintain the numeric nature of the counts these columns have to be separated.
genes <- data.frame(row.names(countdata), countdata$Symbol) colnames(genes) <- c("ID", "Symbol") kable(head(genes))
countdata <- as.matrix(countdata[, -c(1,2)]) kable(head(countdata))
The digital gene expression class, DGE, is a list based S4 class of the edgeR package for storing read counts and associated information.
x <- DGEList(counts = countdata, remove.zeros = TRUE)
To reduce the volume of genes that are analyzed all genes that are not expressed in either of the experiments are removed. Often when doing this analysis genes that are under a certain threshold are removed too. In this analysis we obstain from doing so.
class(x)
dim(x)
The genes from the genes data frame that correspond are expressed need to be added the DGEList object
setDT(genes) genes <- genes[ID %in% rownames(x)] x$genes <- genes$Symbol
More than half of the nucleotides are not expressed and can be removed.
This experiment has 3 controls and 3 experiments. The colums are labelled
r colnames(x)
for the groups the numerical identifiers need to be removed.
This can be done manually or by using regular expressions
group <- as.factor(str_remove_all(colnames(x), "-?\\d")) kable(group)
Having determined the groups we can now create the design:
design <- model.matrix(~0+group) colnames(design) <- str_remove(colnames(design), "group") kable(design)
For the quality controls we need to compute the counts per million values. The code for the next two plots is from the tutorial mentioned above. Please refer to this tutorial for further explanations.
cpm <- cpm(x) lcpm <- cpm(x, log = TRUE)
nsamples <- ncol(x) col <- brewer.pal(nsamples, "Paired") par(mfrow=c(1,2)) plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2, main="", xlab="") title(main="A. Raw data", xlab="Log-cpm") abline(v=0, lty=3) for (i in 2:nsamples){ den <- density(lcpm[,i]) lines(den$x, den$y, col=col[i], lwd=2) } legend("topright", colnames(x), text.col=col, bty="n") lcpm <- cpm(x, log=TRUE) plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2, main="", xlab="") title(main="B. Filtered data", xlab="Log-cpm") abline(v=0, lty=3) for (i in 2:nsamples){ den <- density(lcpm[,i]) lines(den$x, den$y, col=col[i], lwd=2) } legend("topright", colnames(x), text.col=col, bty="n")
x2 <- x x2$samples$norm.factors <- 1 x2$counts[,1] <- ceiling(x2$counts[,1]*0.05) x2$counts[,2] <- x2$counts[,2]*5 par(mfrow=c(1,2)) lcpm <- cpm(x2, log=TRUE) boxplot(lcpm, las=2, col=col, main="") title(main="A. Example: Unnormalised data",ylab="Log-cpm") x2 <- calcNormFactors(x2) x2$samples$norm.factors lcpm <- cpm(x2, log=TRUE) boxplot(lcpm, las=2, col=col, main="") title(main="B. Example: Normalised data",ylab="Log-cpm") lcpm <- cpm(x, log=TRUE) par(mfrow=c(1,2)) col.group <- group levels(col.group) <- brewer.pal(nlevels(col.group), "Set1") col.group <- as.character(col.group) #col.lane <- lane #levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2") #col.lane <- as.character(col.lane) plotMDS(lcpm, labels=group, col=col.group) title(main="A. Sample groups")
The data is already normalized. It appears that there is some higher variance in the controls. But since the authors included all the data we will also keep the data as is.
In this example there is only a single contrast, control vs ifnA. The information regarding the length of exposure was obtained from the GSE entry. Sometimes, however, it is necessary to consult the associated publication.
contr.matrix <- limma::makeContrasts( "ifnA vs | 24h" = "Control-ifnA", levels = colnames(design) ) kable(contr.matrix)
v <- voom(x, design, plot=TRUE)
vfit <- lmFit(v, design) vfit <- contrasts.fit(vfit, contrasts = contr.matrix) efit <- eBayes(vfit)
write.fit(efit, results=NULL, file="RNAseq_efit.txt", digits=10, adjust="fdr", method="separate",F.adjust="none", sep="\t")
The resulting file is ready for submission to NURSA.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.