knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  library(kableExtra),
  library(geometric2)
)

Before reading on, I recommend to go through the tutorial designed by the authors of the limma package. It's hosted on F1000Research.

Getting the data

This process is difficult to stream line. In contrast to microarrays the series matrix files don't contain any experimental data. Ideally normalized data is also submitted to GEO and this can be used instead of a series matrix file. There don't appear to be any conventions on file format or anything else. Generally, the relevant files are listed towards the bottom of the GSE entry website.

Supplementary data files for a GSE entry, the relevant files are highlighted

In this vignette, the GSE81889_Normalized_Ctrl.vs.IFNa.txt file is processed. Note the file needs to be unzipped first.

In contrast to the microarray experience no sample_review.txt file is created here. The relevant meta data is extracted manually. The get_DSinfo() command can still be run, see the annotation vignette for details.

Reading the data

The format of these normalized files is not standardized. Hence, it's necessary to have a look at the file first before proceeding:

countdata_lines <- read_lines("GSE81889_Normalized_Ctrl.vs.IFNa.txt")
kable(head(countdata_lines))

This quick look reveals that the first two lines are headers and the data starts in line 3. R does not support multi-line headers. In the next step the first two lines are concatonated to yield a single header line, The data is read into a table starting from the third line. Lastly the headers are added to the table.

kable(header_lines <- read_lines("GSE81889_Normalized_Ctrl.vs.IFNa.txt", n_max = 2))
split_header_lines <- str_split(header_lines, pattern = "\t")
head1 <- unlist(split_header_lines[1])
kable(head1)

This yields the following two vectors:

head2 <- unlist(split_header_lines[2])
kable(head2)

Combining those makes suitable column names:

kable(headers <- c(head2[2:3], head1[4:9]))
kable(headers)
countdata <- read.table("GSE81889_Normalized_Ctrl.vs.IFNa.txt", 
                        skip = 2, header = FALSE, row.names = 1)
colnames(countdata) <- headers[1:8]
kable(head(countdata))

Creating a matrix

In the previous steps we created a r class(countdata). For use is limma it needs to be converted into a matrix. A crucial difference between the two classes is that data frames accept different types of data whereas matrices do not. In order to maintain the numeric nature of the counts these columns have to be separated.

genes <- data.frame(row.names(countdata), countdata$Symbol)
colnames(genes) <- c("ID", "Symbol")
kable(head(genes))
countdata <- as.matrix(countdata[, -c(1,2)])
kable(head(countdata))

Transforming the data into a DGE List

The digital gene expression class, DGE, is a list based S4 class of the edgeR package for storing read counts and associated information.

x <- DGEList(counts = countdata, remove.zeros = TRUE)

To reduce the volume of genes that are analyzed all genes that are not expressed in either of the experiments are removed. Often when doing this analysis genes that are under a certain threshold are removed too. In this analysis we obstain from doing so.

class(x)
dim(x)

Adding the gene symbols to the DGEList

The genes from the genes data frame that correspond are expressed need to be added the DGEList object

setDT(genes)
genes <- genes[ID %in% rownames(x)]
x$genes <- genes$Symbol

More than half of the nucleotides are not expressed and can be removed.

Creating the design

This experiment has 3 controls and 3 experiments. The colums are labelled r colnames(x) for the groups the numerical identifiers need to be removed. This can be done manually or by using regular expressions

group <- as.factor(str_remove_all(colnames(x), "-?\\d"))
kable(group)

Having determined the groups we can now create the design:

design <- model.matrix(~0+group)
colnames(design) <- str_remove(colnames(design), "group")

kable(design)

Quality Control

For the quality controls we need to compute the counts per million values. The code for the next two plots is from the tutorial mentioned above. Please refer to this tutorial for further explanations.

cpm <- cpm(x)
lcpm <- cpm(x, log = TRUE)
nsamples <- ncol(x)
col <- brewer.pal(nsamples, "Paired")
par(mfrow=c(1,2))
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2,
     main="", xlab="")
title(main="A. Raw data", xlab="Log-cpm")
abline(v=0, lty=3)
for (i in 2:nsamples){
  den <- density(lcpm[,i])
  lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", colnames(x), text.col=col, bty="n")
lcpm <- cpm(x, log=TRUE)
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2,
     main="", xlab="")
title(main="B. Filtered data", xlab="Log-cpm")
abline(v=0, lty=3)
for (i in 2:nsamples){
  den <- density(lcpm[,i])
  lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", colnames(x), text.col=col, bty="n")
x2 <- x
x2$samples$norm.factors <- 1
x2$counts[,1] <- ceiling(x2$counts[,1]*0.05)
x2$counts[,2] <- x2$counts[,2]*5

par(mfrow=c(1,2))
lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Example: Unnormalised data",ylab="Log-cpm")
x2 <- calcNormFactors(x2)
x2$samples$norm.factors

lcpm <- cpm(x2, log=TRUE)
boxplot(lcpm, las=2, col=col, main="")
title(main="B. Example: Normalised data",ylab="Log-cpm")

lcpm <- cpm(x, log=TRUE)
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
col.group <- as.character(col.group)
#col.lane <- lane
#levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2")
#col.lane <- as.character(col.lane)
plotMDS(lcpm, labels=group, col=col.group)
title(main="A. Sample groups")

The data is already normalized. It appears that there is some higher variance in the controls. But since the authors included all the data we will also keep the data as is.

Contrasts

In this example there is only a single contrast, control vs ifnA. The information regarding the length of exposure was obtained from the GSE entry. Sometimes, however, it is necessary to consult the associated publication.

contr.matrix <- limma::makeContrasts(
  "ifnA vs | 24h" = "Control-ifnA",
  levels = colnames(design)
)
kable(contr.matrix)

Fitting the data

v <- voom(x, design, plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts = contr.matrix)
efit <- eBayes(vfit)

Writing the fitfile

write.fit(efit, results=NULL, file="RNAseq_efit.txt", digits=10, adjust="fdr", method="separate",F.adjust="none", sep="\t")

The resulting file is ready for submission to NURSA.



axelmuller/geometric2 documentation built on May 25, 2019, 6:26 p.m.