README.md

ENAbrowseR is an R package to search and retrieve reports from the European Nucleotide Archive Browser REST URL.

Installation

library(devtools)
install_github("cstubben/ENAbrowseR")

Taxonomy statistics

ena_taxonomy accepts either a taxonomy ID or name as input and returns the number of records and size of each sequence database (total bases) on the ENA taxonomy page. The report includes direct hits to the taxa (Yersinia pestis) and hits to all descendants (all strains of Yersinia pestis).

ena_taxonomy("Yersinia pestis")
Yersinia pestis, Taxid:632

                  direct  bases subtree subsize
analysis               0      -       0       -
analysis_study         0      -       0       -
assembly              17  80 Mb     282    1 Gb
coding_release     56749  52 Mb  637112  549 Mb
coding_update       2161   1 Mb   40060   36 Mb
noncoding_release   1267 448 Kb   12798    3 Mb
noncoding_update      69  10 Kb     900  343 Kb
read_experiment      254 295 Gb     406  428 Gb
read_run             254 295 Gb     427  428 Gb
read_study            20 295 Gb     104  428 Gb
read_trace             0      -  579465       -
sample               219      -     359       -
sequence_release    2001  94 Mb  189998    1 Gb
sequence_update       13  13 Kb      44   42 Mb
study                 50      -     192       -
tsa_set                0      -       0       -
wgs_set                8      -     244       -

Sequence/result databases

The usage dataset lists further details about the result databases above, columns available for filtering and returnable fields. These tables are also found on the ENA usage page.

data(usage)
usage$results
              Result                                                Description
1           analysis                    Nucleotide sequence analyses from reads
2     analysis_study Nucleotide sequence analyses from reads (grouped by study)
3           assembly                                          Genome assemblies
4     coding_release                         Protein-coding sequences (Release)
5      coding_update                          Protein-coding sequences (Update)
...

subset(usage$columns, grepl("assembly", Result), 1:3)
                   Column                  Type                                Description
1               accession                  text                           accession number
10         assembly_level controlled vocabulary                             assembly level
11          assembly_name                  text genome assembly name for all live versions
12         assembly_title                  text          brief genome assembly description
13             base_count                number                       number of base pairs
51  genome_representation controlled vocabulary   whether this is a full or partial genome
107      sample_accession                  text                    sample accession number
124                strain                  text      strain from which sample was obtained
125       study_accession                  text                     study accession number
127     study_description                  text      detailed sequencing study description
128            study_name                  text                      sequencing study name
129           study_title                  text         brief sequencing study description

usage$fields$assembly
 [1] "accession"             "study_accession"       "sample_accession"      "assembly_name"        
 [5] "assembly_title"        "study_name"            "study_title"           "study_description"    
 [9] "tax_id"                "scientific_name"       "strain"                "base_count"           
[13] "col_tax_id"            "col_scientific_name"   "assembly_level"        "genome_representation"

Search

ena_search returns a tab-delimited report from one of the 18 result databases. A valid search query is required, for example use tax_tree(632) to list all 282 Y. pestis assemblies.

yp <- ena_search("tax_tree(632)", result= "assembly") 
282 rows

table(yp$assembly_level)
     chromosome complete genome          contig        scaffold 
              6              32             205              39 

ena_search will return all 14 fields listed in usage$fields$assembly and drop any empty columns by default. You can also specify columns or change the limit of 10000 records.

ena_search("tax_tree(632)", result= "assembly", fields=c("study_accession", "scientific_name"), limit=10, showURL=TRUE) 
http://www.ebi.ac.uk/ena/data/warehouse/search?query=tax_tree(632)&result=assembly&fields=study_accession,scientific_name&limit=10&display=report
10 rows

       accession study_accession                            scientific_name
1  GCA_000006645        PRJNA288                     Yersinia pestis KIM10+
2  GCA_000007885      PRJNA10638 Yersinia pestis biovar Microtus str. 91001
3  GCA_000009065         PRJNA34                       Yersinia pestis CO92
4  GCA_000013805      PRJNA16646                   Yersinia pestis Nepal516
5  GCA_000013825      PRJNA16645                    Yersinia pestis Antiqua
6  GCA_000016445      PRJNA16700                Yersinia pestis Pestoides F
7  GCA_000018805      PRJNA16067                     Yersinia pestis Angola
8  GCA_000022805      PRJNA36507                    Yersinia pestis D106004
9  GCA_000022825      PRJNA36545                    Yersinia pestis D182038
10 GCA_000022845      PRJNA36547                    Yersinia pestis Z176003

If you need to download more than the maximum 100,000 records, use offset to get the next 100,000 records (and set drop=FALSE to keep all 68 columns for merging). ena_search also has a resultcount option to count the number of results if needed. In this example, the 316,311 metagenome samples are combined into a single data.frame.

ena_taxonomy("metagenomes")
metagenomes, Taxid:408169
                  direct size   subtree subsize
...
sample              2963    -    316311       -

ena_search("tax_tree(408169)", resultcount=TRUE)
[1] "Number of results: 316,311"

m1 <- ena_search("tax_tree(408169)", limit=100000, drop=FALSE)
m2 <- ena_search("tax_tree(408169)", limit=100000, drop=FALSE, offset=100001 )
m3 <- ena_search("tax_tree(408169)", limit=100000, drop=FALSE, offset=200001 )
m4 <- ena_search("tax_tree(408169)", limit=100000, drop=FALSE, offset=300001 )

mg <- rbind(m1, m2, m3, m4)

 table(substr(mg$first_public, 1,4))
   2010   2011   2012   2013   2014   2015   2016 
  63483  12409  11726  28390  64606 114130  21567 

Metagenomes

This query returns metagenome samples published on March 1, 2016. Seventy columns are returned by default (see usage$field$sample) and 40 empty columns are removed.

m1 <- ena_search("tax_tree(408169) AND first_public=2016-03-01", result="sample")
Dropping 40 empty columns
1014 rows

summary_ena returns the number of unique and total annotations in the table.

summary_ena(m1)
                           unique total
accession                    1014  1014
secondary_sample_accession   1014  1014
sample_alias                 1010  1014
description                   260  1014
location                      152   409
collection_date                58   636
isolate                        45    44
depth                          29   104
environment_feature            27   434
country                        20   636
center_name                    19  1014
...

table2 is a wrapper to table and displays the sorted results as a data.frame

table2(m1$country)
                              n
El Salvador                 153
China                       119
Peru: Lima                   55
Pacific Ocean                52
Svalbard: Foxfonna Ice Cap   37
China:Hunan                  35
Czech Republic               35
China: Heilongjiang, Suihua  30
Canada:Kuujjuarapick         26
France:Neuves-Maisons        23

Environmenal biomes, features or material often display the ENVO ID number only like ENVO:01000005, 01000005 or 1000005. Find matching names in the envo dataset using the 8 digit ID.

table2(m1$environment_feature)
                                  n
human gastric mucosa            119
1000005                          52
composting latrine chamber       35
pond                             34
household washing place          29
latrine urine exit               24
PAH-contaminated soil            23
adjacent to latrine compartment  22
animal feces                     18
street-access sewage             16


data(envo)
subset(envo, id=="01000005")
           id      name
1297 01000005 upwelling

Countries are often missing location data to plot on a map, so the geocode function in ggmap may be used to geocode up to 2500 locations per day.

library(ggmap)
z <- table2(m1$country[m1$location==""])
                              n
El Salvador                 153
Peru: Lima                   55
China:Qinghai_tibet_Plateau  16
USA: South Burlington         3

y <- geocode(rownames(z), output="more")
y[,1:5]
        lon       lat            type     loctype                                 address
1 -73.17096  44.46699        locality approximate               south burlington, vt, usa
2 -88.89653  13.79419         country approximate                             el salvador
3 -77.04279 -12.04637        locality approximate                              lima, peru
4  92.00000  33.00000 natural_feature approximate qing zang gaoyuan, golmud, haixi, china

The add_lat_lon function will split location into new lat and lon columns for easier plotting in ggmap, leaflet and other packages and will also add geocoded locations for any places with missing coordinates.

table2(m1$location)
                        n
33.55 N 118.4 W        52
78.08 N 16.07 E        36
26.45 N 111.52 E       35
50.01636 N 14.462833 E 35
46.65 N 126.79 E       30
49.214 N 5.996 E       23


m1 <- add_lat_lon(m1)
Geocoding missing locations for
 China:Qinghai_tibet_Plateau
 El Salvador
 Peru: Lima
 USA: South Burlington

Count the number of samples at each location and plot using ggmap.

x <- aggregate(list(n=m1$lat), m1[, c("lon", "lat")], length)
#x <- group_by(m1, lon, lat) %>% summarize( n=n())     # using dplyr

map <- get_googlemap(c(20,20), zoom=1, size=c(510, 320)  )
ggmap(map) + geom_point(data = x, alpha = .7, aes(x=lon, y=lat, size = n ),color='red') 
  + ggtitle("Metagenome samples on March 1, 2016")

or use leaflet for interactive maps

library(leaflet)
x <- subset(m1, !is.na(lat))
popups <- paste0("acc: ", x$secondary_sample_accession, "<br> name: ", x$scientific_name) 
leaflet(x)  %>% addTiles() %>%  
   addCircleMarkers( x$lon, x$lat,  clusterOptions =
   markerClusterOptions(maxClusterRadius=20), radius=2,  popup= popups)

This final query returns sequencing runs from metagenomes published on March 1, 2016. In most cases, the runs from a sample are released on the same day, but not always.

r1 <- ena_search("tax_tree(408169) AND first_public=2016-03-01", result="read_run")
Dropping 5 empty columns
911 rows

table(r1$sample_accession %in% m1$accession )

FALSE  TRUE 
   37   874 

table2(r1$study_title)
                                                                                                                          n
Changes in the eye microbiota associated with contact lens wearing                                                      325
Secretions from patients infected with bacteria and viruses raw sequence reads                                          176
Forest soil from a mature stand in Sala, Middle Sweden cultivar:uncultured fungus Metagenomic assembly                  143
Rumen Bacteria Raw sequence reads                                                                                        87
Comparison of samples and mock communities amplified with single-base variations in 515-Forward universal primer         60
Bacterial biogeography of a High Arctic ice cap                                                                          37
Changes in rhizosphere bacterial communities induced by the invasive plant Pennisetum setaceum in semiarid environments  30
suan cai in Northeast China Raw sequence reads                                                                           30
salt lake microbial community diversity                                                                                  16
Lejia Lake soil r16S amplicons, raw sequence reads                                                                        4

You can sum the sra or fastq_bytes with sum_bytes (and paired libraries have two files that should be split). The locations of FASTQ files are also included in the table.

sum_bytes(r1$sra_bytes)
[1] "86 GB"

sum_bytes(as.numeric(unlist(strsplit(r1$fastq_bytes, ";"))))
[1] "108 GB"

r1$sra_ftp[1:3]
[1] "ftp.sra.ebi.ac.uk/vol1/fastq/ERR119/000/ERR1198770/ERR1198770.fastq.gz"
[2] "ftp.sra.ebi.ac.uk/vol1/fastq/ERR119/001/ERR1198771/ERR1198771.fastq.gz"
[3] "ftp.sra.ebi.ac.uk/vol1/fastq/ERR119/002/ERR1198772/ERR1198772.fastq.gz"

And finally to plot the number of bases or reads.

hist( log10(r1$base_count ), br=40, col="blue", main="", xlab="Bases (log 10)" )



cstubben/ENAbrowseR documentation built on May 14, 2019, 12:25 p.m.