Introduction

This is a tutorial for using mgrastr to preprocess mg-rast data for downstream analyses.

First we need to download the data in a usable format. The taxonomy and functional assignments data are stored differently. MG-Rast has taxonomy assignments summarized and easily accessible as downloadable files for each metagenome that has been annotated. Functional assignment summaries however are calculated on the fly, and the data are output in different formats. Therefore we need slightly different approaches for accessing and reformatting the data for downstream analyses depending on whether we are interested in taxonomic or functional data. To do this, I have made a little package I call mgrastr.

Install mgrastr

install.packages(devtools)
library(devtools)
install_github("djeppschmidt/mgrastr") # requires R V 3.4.4 or later

Load mgrastr into R environment

library(mgrastr)

MG-Rast Taxonomy import to R

As I mentioned previously, the taxonomy data is managed differently in Phyloseq. There are a number of summary tables that are already compiled and can be called directly from the REST server. But before we do this, we need a list of accession numbers. I have made a list of my own by accessing each metagenome, and copy/paste the metagenome ID into an excel file. The accession ID for each sample should start with "mgm" and end with ".3"; it is the same ID for each level of taxonomy. When we downlaod the data, for each accession number we will get a list of tables; one for each taxonomic level. But I'm getting ahead of myself. I've compiled my list of accession numbers in excel and imported as a .csv; you could write your directly in R to avoid the formating issues we'll deal with in the next step. As an example, we will be downloading the first ten samples from my GLUSEEN study. The accession numbers are already loaded in the package. Let's look at their format:

#needs to be done yet...

You can replicate for your own data; you will need a spreadsheet (.csv file) with three columns: the unique sample ID; the unique sequence ID (this would differentiate for two sequencing runs that were annotated seperately on the same sample); and the MG-Rast accession number (e.g. mgm4819267.3)

# make list of samples ####
O.key<-read.csv("~/Desktop/PhD/Metagenome/Organism/DownloadKey.csv")
O.key<-data.frame(O.key)
head(O.key)

OK, now we need to make sure that R is reading the key properly. It somtimes imports the data factors. This doesn't work for us, so we tell it to read them as characters:

l.sID<-as.character(O.key$MG_ID) # make list of accession numbers
names(l.sID)<-O.key$Seq_ID # name list according to unique sample IDs

Now, we prepare the list of IDs for our download function by telling R that they are a list, and naming each of the members (this will be important later on to ensure that we don't mislabel data files):

l.sID<-as.list(O.key$Private_ID)
names(l.sID)<-O.key$Seq_ID

Now, we download the taxonmy data:

l.taxa<-download(l.sID, auth)

Now, we have a large list of lists of tables e.g. l.taxa[sample_1[genus.table]]

Before we aggregate everything together, we have to prepare the metadata: (This metadata will also be used later in the download functions workflow as well)

Note: sample_key should have at least twice the number of entries in it than sam because it includes individual data for the forward and reverse reads for each run for a particular sample. sam should only include metadata that is particular at the sample level or higher. An easy mistake is to include "sample data" for replicate runs or f/r reads in the sam file. This will cause later functions to fail.

sample_key<-read.csv("~/Desktop/PhD/Metagenome/SampleKey.csv")
sample_key<-data.frame(sample_key)
rownames(sample_key)<-sample_key$Seq_ID

sam<-as.data.frame(read.csv("~/Desktop/PhD/Metagenome/GLUSEEN_HM2_metadata.csv"))
rownames(sam)<-sam$Sample_ID

The next step to to aggregate the samples together into a table of taxa, and then condense the replicates (just like in the function table). I do this in two steps:

t.domain<-ag.tax(l.taxa, ag.Domain) # builds the count table
ct.domain<-condense(t.domain, samkey=sample_key, sam=sam) # combines replicate sequencing runs

The condense step is not necessary if you combined your forward/reverse and resequencing fasta files before the annoataion run. I did not, so I have to merge the annotations now. I'm not concerned about double counting because all my runs have both a forward and reverse read, therefore I'm "double counting" equally across all my samples and as long as I'm consistent I will get the same pattern as if I averaged the samples.

And, of course we can run this process at many different taxnomic levels:

t.phylum<-ag.tax(l.taxa, ag.Phylum)
t.class<-ag.tax(l.taxa, ag.Class)
t.order<-ag.tax(l.taxa, ag.Order)
t.family<-ag.tax(l.taxa, ag.family)
t.genus<-ag.tax(l.taxa, ag.Genus)
t.species<-ag.tax(l.taxa, ag.Species)

ct.phylum<-condense(t.phylum, samkey=sample_key, sam=sam)
ct.class<-condense(t.class, samkey=sample_key, sam=sam)
ct.order<-condense(t.order, samkey=sample_key, sam=sam)
ct.family<-condense(t.family, samkey=sample_key, sam=sam)
ct.genus<-condense(t.genus, samkey=sample_key, sam=sam)
ct.species<-condense(t.species, samkey=sample_key, sam=sam)

The output are phyloseq objects with OTU tables and metadata merged together for easy downstream processing. YAY!!

R-based functional annotation download

Right now this function only works for the Subsystems ontology. Hopefully I will get to expanding it's capability soon!

For starters, a central challenge for downloading summarized annotations from MG-Rast is that the counts are aggregated on the fly. This means that there is some time between the request, and when the results are posted to the output url. For execution simplicity (and reduce computational demand) I've divided the workflow into two basic steps. The first submits the request to MG-Rast, and the second downloads the results from the url supplied by MG-Rast. It can take MG-Rast some time to fill the request, and the url is available for quite a long time after the request is submitted. So I recommend setting up the request to run over night, then run the download and parse script first thing on a Monday. This is important because if you run the download script too early, the data for some of the samples will not be aggregated yet and it can cause errors.

The first step is to define the input variables. In this case:

x = an accession number or a list of accession numbers (prefered in the e.g. mgm1111111.3 format) auth = your authentication key ont = what source of annotation; can be one of "Subsystems", "KO", "NOG", or "COG" parallel = logical

A note on formatting the input: it should be a list, with the unique sample ID as the key (or name) for every member of the list. Here's an example of how I make my input list:

O.key<-read.csv("~/Desktop/PhD/Metagenome/SampleKey.csv")
O.key<-data.frame(O.key)
O.key$Seq_ID<-as.character(O.key$Seq_ID) # unique sample IDs
O.key$Private_ID<-as.character(O.key$MG_ID) # unique accession codes in mgm... .3 format

l.sID<-as.character(O.key$MG_ID) # make list of accession numbers
names(l.sID)<-O.key$Seq_ID # name list according to unique sample IDs

Ok, now we call the script that will start MG-Rast compiling the data.

level = level of ontology to querry. acceptable values are anything between 1 and 4. E = the e-value. is interpreted as e^-E; thus higher values mean smaller e values, and therefore less random matching. MG-Rast devault is 5. I usually do at leat 10. length = the minimum length of match. MG-RAST default is 15 id = minimum percent bases matching. MG-RAST default is 60

THIS PROGRAM HAS NO DEFAULT VALUE, YOU MUST PICK YOUR OWN

url<-loadFunc(l.sID,  auth="gbdwiyCZqACamvn8fG59aTs3Z", ont="Subsystems", level=3, E=15, length=15, id=60, parallel=FALSE)
head(url)

This script will produce a list of urls that will eventually be populated with the annotation data. It should submit the requests relatively quickly; if you want you can wait 15-20 minutes to make sure the script completes without errors and shut things down for the evening. The long wait is on the server end as it's quietly compiling the results. They may not be ready for 1-20 hours, depending on how large your files are and how many requests you are submitting at once (how long your list of accession numbers is). Most files will be ready in under an hour, but it's best to be patient and let the server finish.

To download the data, we simply execute the following:

x = list of urls ont = ontology that was requested; this tells script what format to expect he data in (each ontology has it's own output format). level = ontology level of interest. Must be one of 1, 2, 3, or 4. 4 is higest granularity, and my prefered level for intepretation. You can always decrease resolution, but if you start low it's hard to increase.

dt<-downloadFunc(url, ont="Subsystems", level=4)

The output of this function is a table with samples as columns and functions as rows, populated with counts. From here, we condense replicate sequencing runs and forward/reverse reads, just like in the taxonomy workflow.

dt<-condense(dt, samkey, sam)

Now we've got our functions and data table, but the annotations aren't very interpretable. You can check to see:

head(tax_table(dt))

The accession numbers don't mean much on their own. To make a table with the annotations at each of the four subsystems ontology levels, we can do:

f.tax<-DFTax(url)

Now we've got a table with the annotations at different ontological resolution. So let's combine it with the count data:

tax_table(dt)<-f.tax
head(tax_table(dt))

OK We're finished and ready for analysis. YAY!!

Troubleshooting:

IF URL list is NULL: Often the authentication key fails even if technically it is 100% correct. I don't know why this happens. I get around it by going into mg-rast and copy/paste my authentication from mg-rast back into R. This usually works.

IF the condense() function gives an empty table or error: Make sure that url contains a url for each of the files you submitted to mg-RAST. All requests must load for this pipeline to work. If you accidentaly have duplicate accession numbers, or have a typo in one, the downstream steps will fail. Step one is to check to make sure your input files are formatted properly, and the list of file names and accession numbers are all unique.

If you are getting an output table that doesn't include all your samples, pick one that you are missing. Copy the url for the request into a web browser. For example, access the url request from R script by:

url$GLU001_R1

If the webpage contains this, then mg-RAST is not done compiling the data:

If the webpage contains this, then you need to lower the QC parameters for matching annotations:

{"ERROR":"no data returned for the given parameters, try again with lower cutoffs (evalue, identity, length)"}

to insert an image:

![caption]filepath.png


Djeppschmidt/mgrastr documentation built on Aug. 9, 2019, 8:42 p.m.