In axelmuller/geometric2: NURSA annotation aid

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  library(geometric2),
  library(kableExtra)
)

Microarray spots are usually labelled in platform/manufacturer specific ways. In order to be useful to NURSA the custom-name for a nucleotide must me matched to a gene symbol. In an ideal scenario a reannotation matrix for a given platform exists and the process is straight forward using the process_fitoutput() function. If no reannotation matrix is available the gpl-soft file initially downloaded with get_input() has to be used. Unfortunately, it appears there is no standard for gpl-soft files and it is necessary to inspect the file and develop a strategy.

This vignette provides a couple of examples that should provide sufficient material to kickstart and inspire scripts in order to obtain the necessary information from any gpl-soft file.

First step

The following command reads in the fitted data:

fitfiles <- get_fitfiles(path2fitfiles)

Annotating fit output file with the help of an reannotation matrix

Next the package GEOmetadb is used to obtain metadata associated with the GSE entry. This step requires a local installation of a recent GEOmetadb.sqlite file.

gse <- "valid_GSE_entry" path2GEOmetadb <- "path/to/GEOmetadb.sqlite" DSinfo <- get_DSinfo(gse, path2GEOmetadb)

The DSinfo dataframe has the following columns:

"GSE", "DatasetType", "Title", "DatasetContrib", "PMID", "SubmissionDate", "LastUpdateDate", "GPL", "Species", "Reannotation"

If the value for Reannotation is not NA, the reannotation matrix can be obtained from Bioconductor

source("https://bioconductor.org/biocLite.R") biocLite(DSinfo$Reannotation)

Once this library is loaded the process_fitoutput() function creates the desired output file that can be submitted to NURSA

library(name_of_reannotion_matrix) process_fitoutput(gse, fitfile = fitfiles[1], name_of_reannotation_matrix)

Annotating fit output file without an reannotation matrix

If no reannotation matrix is available the gpl soft file has to be used to complete this step.

The dataframe DSinfo also specifies the GPL entry. For some GPLs the following function works:

process_fitoutput_with_gpl(gse, fitfile = fitfiles[1], "gpl_entry")

In order for this function to work there has to be a SYMBOL column in the gpl file. Often that is not the case.

In those cases the gpl file needs to be inspected and a strategy to extract the relevant information has to be devised. This is often a customized process.

Exploring a gpl file

GPL files contain a header which contains a lot of meta data.

path2gpl <-list.files(pattern = "GPL1310.soft", recursive = TRUE)[1]
read_lines(path2gpl, skip = 0, n_max = 20)

The last line of the header is:

!platform_table_begin

This is followed by a tab-separated table that contains the information needed to complete the task.

read_lines(path2gpl, skip = 130, n_max = 20)

The files concludes with the line:

!platform_table_end

To transform this file into a tsv-file (tab separated file) linux's grep command can be leveraged. This can be either performed in a terminal or within R using the system() command. Please note, this might not be an option on Windows computers.

The grep command requires a pattern and a path to a file. The output can be redirected from the standard output to a file. Here we modify the grep command with -v which yields everything that does not match. The use of extended regular expressions is enabled with the -E flag.

The path to the soft-file is r path2gpl. The pattern given to the grep command needs to be in quotes, so does the construct given to the system() command. It is important to use single quotes for one and double for the other. This is not necessary when running the command from a terminal. Also, when using the system command the escape character () needs to be escaped. Hence ! turns into \!. In the terminal a single escape character is required.

system('grep -Ev "#|\\!|\\^"  vignettes/GEOtemp/GPL1310.soft > vignettes/GEOtemp/GPL1310.tsv')

The resulting tsv file can now be read in using data.table's fread() command:

gpl <- fread("GEOtemp/GPL1310.tsv")
kable(head(gpl))

This entry links the spot ID to genbank accession codes only. The other columns contain information that is not needed. We select the two relevant columns.

gpl_sel <- gpl[, .(ID, GB_ACC)]
setkey(gpl_sel, GB_ACC)

The Bioconductor package Mus.musculus helps to identify the corresponding gene symbols. Likewise, the packages Homo.sapiens and Rattus.norvegicus can be used if the humans or rats were the target species.

library(Mus.musculus)
GB_SYMBOL <- select(Mus.musculus, keys = gpl$GB_ACC,
                    columns = c("ACCNUM", "SYMBOL"), keytype = "ACCNUM")
GB_SYMBOL <- na.omit(GB_SYMBOL)
setDT(GB_SYMBOL)
setkey(GB_SYMBOL, ACCNUM)

kable(head(GB_SYMBOL))

This data.table can now be joined with the gpl table:

setkey(GB_SYMBOL, ACCNUM)

With keys in both data.tables set to the genbank entry ID the tables can be joined:

gpl_sel <- gpl_sel[GB_SYMBOL]
kable(head(gpl_sel))

Once the key for this data.table has been set to ID it be joined to the fitted data.

The name of the column with the spot ID in the fit file is V1. The necassary commands are hence:

fit <- setDT(fitfiles[1]) setkey(fit, V1) setkey(gpl_sel, ID) annotated_fit <- fit[gpl_sel]

To be coherent with the limma output the first column needs to be renamed to A.

names(anno_fit)[1] <- "A"

Now the file is ready to be submitted to NURSA.

fwrite(anno_fit, "Results/GSE6346_annotated_fit.txt", sep = "\t")

Other GPL formats

In some instances the information for the gene symbol is available in the GPL file but the column name differs from SYMBOL. In these cases the ID and the column holding the SYMBOL data can be selected into a new data.table and joined with the fitted data. The process is analoguous to the one described above.

Sometimes all gene annotation is burried in a single column of the GPL file. This can take the form of: gene bank entry; gene symbol; some description; other information If this is the case the corresponding column has to be split on its separator. In the example given here that would be on the semicolon. data.table provides the tstrsplit for this purpose.

Compile meta data for NURSA

The meta data NURSA requires can be found in the contrast matrix and the sample-review.txt file. The remaining information can be obtained by running the command