OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics data

General Overview

The OmicsMetaData package was developed as a set of tools to download, standardize and handle 'omics datasets, and more particulary the non-nucleotide sequence metadata components. The aim is to help improve data interoperability and re-use in biodiversity 'omics fields, such as metabarcoding (e.g. eDNA, amplicon sequencing of microbes) and metagenomics (e.g. shotgun meta genome sequencing). More specifically, this package offers the users tools to help them integrate online open access data and metadata into their analyses, and standardize their de novo data and metadata following the Minimum Information about any (x) sequence (MIxS) or the DarwinCore (DwC) standards to use it in integrated analyses with other data, or archive it on online nucleotide sequence data repositories (e.g. ENA, SRA, GenBank,..) in accordance to the FAIR principles.

Downloading data

The main objective and focus of the OmicsMetaData package is re-use of 'omics data, as part of the phylosophy of the FAIR principles for data (FAIR standing for Findable Accessible Interoperable and Re-usable data). The International Nucleotide Sequence Database Consortium (INSDC), which unites NCBI, ENA-EMBL and DDBJ, houses the vast majority of the world's publicly available DNA sequence data, including marker gene, genome, amplicon/metabarcode, and shotgun metagenome data from past scientific projects all over the world. The OmicsMetaData packages contains several functions that can help with downloading data (sequences and metadata) from INSDC. See a detailed example in the Retrieving_online_data vignette. Note, for some functions, an NCBI API-key will be necessary. This can be created after registering an account at NCBI. More information on the API client of NCBI can be found here: https://www.ncbi.nlm.nih.gov/home/develop/api/

metadata_of_seq <- download.sequences.INSDC(BioPrj, destination.path = "_path_", apiKey, unzip = FALSE, keep.metadata = TRUE, download.sequences = TRUE)
metadata_of_seq <- get.BioProject.metadata.INSDC(BioPrjct, just.names=FALSE)
metadata_of_seq <- get.sample.attributes.INSDC(BioPrjct, apiKey)

Standardizing data

Another important objective that is embedded in the FAIR principles is data interoperability, which can only be achieved by standardizing datasets over many users. Therfore, this package offers a suite of tools to standardize datasets using the MIxS (Yilmaz et al. 2011) and DarwinCore (Darwin Core Task Group. 2009) data standards. For details on these standards and their implementation, see the Background vignette. More information on MIxS, see https://gensc.org/mixs/ and https://github.com/GenomicsStandardsConsortium/mixs. For more information on DarwinCore, see https://dwc.tdwg.org and https://github.com/tdwg/dwc. For detailed examples on standardizing metadata, see the Metadata_standardization vignette.

term.definition("_MIxS_term_")
dataQC.TermsCheck(observed=c("_list_of_terms_"), exp.standard="MIxS", exp.section=NA, fuzzy.match=TRUE, out.type="full")
combine.data(d1=MIxS.metadata_1, d2=MIxS.metadata_2, fill=NA, variables.as.cols=TRUE)
combine.data.frame(df1, df2, fill=NA, merge.cols=TRUE, original_rowName.col=TRUE, merge.rows="df1")
formated_dataset <- eMoF.to.wideTable(dataset)
dataset <- wideTable.to.eMoF(dataset=formated_dataset)

There are functions to ...

... identify a date column in a dataset and format it as ISO 8601 (YYYY-MM-DD)

dataset_QC <- dataQC.dateCheck(dataset, date.colnames=c("collectionDate"))  

... identify one or more columns that have the geographic coordinates (longitude and/or latitude), and format them to decimal degrees (with automatic detection of original format, including NSWE notations and degree-minute-second symbols). Conversion can also be done for a single value with the coordinate.to.decimal (which is the work-horse of the dataQC.LatitudeLongitudeCheck function).

dataset_QC <- dataQC.LatitudeLongitudeCheck(dataset, latlon.colnames=c("lat_lon"))
coordinate.to.decimal("66° 34′") # will return 66.56667

in case of standardized coordinates, it can be converted to a WKT notation with the dataQC.generate.footprintWKT function.

dataset_QC <- dataQC.generate.footprintWKT(dataset, NA.val=NA)

... correct trailing spaces or common typos in a list taxonomic names with the dataQC.taxaNames function.

dataQC.taxaNames(taxaNames)

... complete a list of taxonomic names by looking-up missing information on the World Registry of Marine Species (WoRMS, Horton et al. 2021), using the WoRMS API and the worrms R-package (Chamberlain, 2020). More information here: https://www.marinespecies.org/aphia.php?p=webservice&type=r

dataQC.completeTaxaNamesFromRegistery("Aulacoseira", taxBackbone="worms")

# will return:
#ScientificName | scientificNameID  
#Aulacoseira    | urn:lsid:marinespecies.org:taxname:148959 
#
#aphID   | kingdom    | phylum      | class
#148959  | Chromista  | Ochrophyta  | Bacillariophyceae
#
#order           | family           | genus 
#Aulacoseirales  | Aulacoseiraceae  | Aulacoseira            
#
#specificEpithet  | scientificNameAuthorship  | namePublishedInYear
#<NA>             | G.H.K. Thwaites           | 1848

data that can serve as input for the dataQC.completeTaxaNamesFromRegistery function can also be extacted from a dataset using the dataQC.TaxonListFromData function.

taxList <- dataQC.TaxonListFromData(dataset)
dataset.MIxS <- dataQC.MIxS(dataset, ask.input=TRUE, sample.names = "_sample_name_column_")

in case of DarwinCore, core (event or occurrence) and extension (eMoF) files can be handled separately using the dataQC.DwC_general function, with DwC.type parameter specifying the type of the input. These data.frames can also be compared at once, returning a darwincore archive style object, using the DataQC.DwC. At present, the core and extension files are limited to event and occurrence core, and eMoF as extension.

dataset.DwC <- dataQC.DwC_general(dataset, DwC.type = "event", ask.input = TRUE, complete.data=TRUE)

dataset.DwC.archive <- DataQC.DwC(Event=dataset_event, eMoF=dataset_emof, EML.url="_url_to_EML_XML_file_", out.type="event", ask.input=TRUE))

Data objects

With 'omics datasets, it is no longer possible to display all data in a sample x variable table. There are additional dimensions to the data that also need to be taken into account, and the most obvious of these extra dimensions are the sequence files. But there is also other information, often referred to as metadata that are needed to contextualize the dataset. These include units of variables, project information, applied lab and sequencing protocols, links to the location of the sequence data, associated publications,...

To be able to work with these dimensions of data, the OmicsMetaData package created new S4 class objects. Slots of these objects can be accessed using an "@" instead of "$". The MIxS.metadata type object has slots for a.o. data (values), units, the MIxS environmental package name and the section of MIxS where each variable comes from. A DwC.event type object has slots for an event core and extensions, while a DwC.occurrence has an occurrence file at it's core. The validity of these objects can be checked with respectively check.valid.metadata.MIxS and check.valid.metadata.DwC. A MIxS.metadata object can also be written to a CSV file with the write.MIxS method

Archiving data

Archiving data, and getting in the right format and standard can be very challenging to say the least. Depending on the platform or database where the sequence data and metadata is archived, different requirements can be set and subtle variations to data standards implemented.

Therefore the OmicsMetaData package has a suite of function to format (meta-)data that can be uploaded directly to the European Nucleotide Archive (ENA, part of INSDC). Details on archiving data are set out in the Perpare_data_for_archiving vignette.

seq_files <- FileNames.to.Table("_path_to_seqFiles_")

These sequence files can also be compared to the sample names in a metadata file to ensure there are no discrepancies.

sync.metadata.sequenceFiles(row.names(main_metadata.MIxS@data), file.dir="_path_to_seqFiles_")

If necessary, sequence files can be renamed in bulk first, based on a dataframe with a column of old names and a column of corresponding new names.

names_dic <- data.frame(seq_facility_name = c("AA1234", "BB2345"), original_name = c("sampleLocation_year_1", "sampleLocation_year_2"))
renameSequenceFiles(names_dic, colNameOld="seq_facility_name", colNameNew="original_name", file.dir= "_path_to_seqFiles_")
prep.metadata.ENA(metadata=a_metadata.MIxS_object, dest.dir=get.wd(), file.name="_a_file_name_")

Citation

Yilmaz, P., Gilbert, J., Knight, R. et al. The genomic standards consortium: bringing standards to life for microbial ecology. ISME J 5, 1565–1567 (2011). https://doi.org/10.1038/ismej.2011.39

Darwin Core Task Group. 2009. Darwin Core. Biodiversity Information Standards (TDWG) http://www.tdwg.org/standards/450

Chamberlain, S. (2020). worrms: World Register of Marine Species (WoRMS) Client. R package version 0.4.2. https://CRAN.R-project.org/package=worrms

Horton, T., Kroh, A., et al. (2021). World Register of Marine Species. Available from https://www.marinespecies.org at VLIZ. doi:10.14284/170



biodiversity-aq/OmicsMetaData documentation built on Dec. 19, 2021, 9:44 a.m.