In biodiversity-aq/OmicsMetaData: OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics (meta)data

OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics data

General Overview

The OmicsMetaData package was developed as a set of tools to download, standardize and handle 'omics datasets, and more particulary the non-nucleotide sequence metadata components. The aim is to help improve data interoperability and re-use in biodiversity 'omics fields, such as metabarcoding (e.g. eDNA, amplicon sequencing of microbes) and metagenomics (e.g. shotgun meta genome sequencing). More specifically, this package offers the users tools to help them integrate online open access data and metadata into their analyses, and standardize their de novo data and metadata following the Minimum Information about any (x) sequence (MIxS) or the DarwinCore (DwC) standards to use it in integrated analyses with other data, or archive it on online nucleotide sequence data repositories (e.g. ENA, SRA, GenBank,..) in accordance to the FAIR principles.

Downloading data

The main objective and focus of the OmicsMetaData package is re-use of 'omics data, as part of the phylosophy of the FAIR principles for data (FAIR standing for Findable Accessible Interoperable and Re-usable data). The International Nucleotide Sequence Database Consortium (INSDC), which unites NCBI, ENA-EMBL and DDBJ, houses the vast majority of the world's publicly available DNA sequence data, including marker gene, genome, amplicon/metabarcode, and shotgun metagenome data from past scientific projects all over the world. The OmicsMetaData packages contains several functions that can help with downloading data (sequences and metadata) from INSDC. See a detailed example in the Retrieving_online_data vignette. Note, for some functions, an NCBI API-key will be necessary. This can be created after registering an account at NCBI. More information on the API client of NCBI can be found here: https://www.ncbi.nlm.nih.gov/home/develop/api/

downloading individual sequence files on INSDC or all sequence files from a BioProject, use the download.sequences.INSDC function. This will write the sequence files to the destination.path, and if keep.metadata = TRUE, the metadata will be written to the metadata_of_seq object in the R environment.

metadata_of_seq <- download.sequences.INSDC(BioPrj, destination.path = "_path_", apiKey, unzip = FALSE, keep.metadata = TRUE, download.sequences = TRUE)

It is also possible to download only the sequence metadata. This can either be the minimal set of technical metadata using the get.BioProject.metadata.INSDC function, or the full set of all availavle metadata with the get.sample.attributes.INSDC function (NCBI API-key required)

metadata_of_seq <- get.BioProject.metadata.INSDC(BioPrjct, just.names=FALSE)
metadata_of_seq <- get.sample.attributes.INSDC(BioPrjct, apiKey)

Standardizing data

Another important objective that is embedded in the FAIR principles is data interoperability, which can only be achieved by standardizing datasets over many users. Therfore, this package offers a suite of tools to standardize datasets using the MIxS (Yilmaz et al. 2011) and DarwinCore (Darwin Core Task Group. 2009) data standards. For details on these standards and their implementation, see the Background vignette. More information on MIxS, see https://gensc.org/mixs/ and https://github.com/GenomicsStandardsConsortium/mixs. For more information on DarwinCore, see https://dwc.tdwg.org and https://github.com/tdwg/dwc. For detailed examples on standardizing metadata, see the Metadata_standardization vignette.

A first step in standardizing the metadata part of an 'omics dataset, is implementing a data standard. This means (1) variable names must be those of the standard's terminology, including the use of capitals and type of word spacing, (2) values must adhere to rules attached to some specific terms, like the use of a certain unit, the notation of multiple values with a separator or the use of a vocabulary like ENVO, (3) the minimal set of information to describe a dataset that is specified by the standard must be given. Note, for any additional information that is not covered by any term in the standard, it is better to add it using a non-standard term than to leave it out (=loss of data). This job can't be automated, and must be done manually. The full list of MIxS or DarwinCore terms can be accessed by calling the TermsLib data object. The definition of a term can be called with the term.definition function. Running the dataQC.TermsCheck can also help to find a match for non-standard variable names, but human interpretation and supervision is required.

term.definition("_MIxS_term_")
dataQC.TermsCheck(observed=c("_list_of_terms_"), exp.standard="MIxS", exp.section=NA, fuzzy.match=TRUE, out.type="full")

to help with data standardization, there are also a number of data formatting tools. combine.data and combine.data.frame can combine respectively MIxS.metadata or data.frame objects, aligning variable or sample names.

combine.data(d1=MIxS.metadata_1, d2=MIxS.metadata_2, fill=NA, variables.as.cols=TRUE)
combine.data.frame(df1, df2, fill=NA, merge.cols=TRUE, original_rowName.col=TRUE, merge.rows="df1")

When using DarwinCore, 'omics metadata can be formatted in an "extended Measurement or Fact" (eMoF) extension, which are in a wide format (variable-sampleName-value) instead of a tabular table (sample x variable). The eMoF.to.wideTable and wideTable.to.eMoF functions can help convert between these formats.

formated_dataset <- eMoF.to.wideTable(dataset)
dataset <- wideTable.to.eMoF(dataset=formated_dataset)

when the variables names in the dataset are standardized following DwC or MIxS, the values of each variable still needs to comply to the rules of that variable. For example, latitudes and longitudes must be in the decimal degree format, and not as degrees minutes seconds, or date notations must be in the ISO 8601 (YYYY-MM-DD) format,... For each variable, associated value rules can usually be found in the variable's definition. In the OmicsMetaData packages, checking and re-formatting values can be done for each variable individually, or can be done in an automated way using the dataQC.MIxS or the dataQC.DwC_general and dataQC.DwC functions (this will be shown later).

There are functions to ...

... identify a date column in a dataset and format it as ISO 8601 (YYYY-MM-DD)

dataset_QC <- dataQC.dateCheck(dataset, date.colnames=c("collectionDate"))

... identify one or more columns that have the geographic coordinates (longitude and/or latitude), and format them to decimal degrees (with automatic detection of original format, including NSWE notations and degree-minute-second symbols). Conversion can also be done for a single value with the coordinate.to.decimal (which is the work-horse of the dataQC.LatitudeLongitudeCheck function).

dataset_QC <- dataQC.LatitudeLongitudeCheck(dataset, latlon.colnames=c("lat_lon"))
coordinate.to.decimal("66° 34′") # will return 66.56667

in case of standardized coordinates, it can be converted to a WKT notation with the dataQC.generate.footprintWKT function.

dataset_QC <- dataQC.generate.footprintWKT(dataset, NA.val=NA)

... correct trailing spaces or common typos in a list taxonomic names with the dataQC.taxaNames function.

dataQC.taxaNames(taxaNames)

... complete a list of taxonomic names by looking-up missing information on the World Registry of Marine Species (WoRMS, Horton et al. 2021), using the WoRMS API and the worrms R-package (Chamberlain, 2020). More information here: https://www.marinespecies.org/aphia.php?p=webservice&type=r

dataQC.completeTaxaNamesFromRegistery("Aulacoseira", taxBackbone="worms")

# will return:
#ScientificName | scientificNameID  
#Aulacoseira    | urn:lsid:marinespecies.org:taxname:148959 
#
#aphID   | kingdom    | phylum      | class
#148959  | Chromista  | Ochrophyta  | Bacillariophyceae
#
#order           | family           | genus 
#Aulacoseirales  | Aulacoseiraceae  | Aulacoseira            
#
#specificEpithet  | scientificNameAuthorship  | namePublishedInYear
#<NA>             | G.H.K. Thwaites           | 1848

data that can serve as input for the dataQC.completeTaxaNamesFromRegistery function can also be extacted from a dataset using the dataQC.TaxonListFromData function.

taxList <- dataQC.TaxonListFromData(dataset)

Standardizing metadata can also be semi-automated for the MIxS (DataQC.MIxS function) and DarwinCore standards (dataQC.DwC_general and dataQC.DwC). These function look at the variable names, and in some cases at the values as well (see function documentation for details). The output of these functions are S4 class objects that take different aspects of metadata in account (see further).

dataset.MIxS <- dataQC.MIxS(dataset, ask.input=TRUE, sample.names = "_sample_name_column_")

in case of DarwinCore, core (event or occurrence) and extension (eMoF) files can be handled separately using the dataQC.DwC_general function, with DwC.type parameter specifying the type of the input. These data.frames can also be compared at once, returning a darwincore archive style object, using the DataQC.DwC. At present, the core and extension files are limited to event and occurrence core, and eMoF as extension.

dataset.DwC <- dataQC.DwC_general(dataset, DwC.type = "event", ask.input = TRUE, complete.data=TRUE)

dataset.DwC.archive <- DataQC.DwC(Event=dataset_event, eMoF=dataset_emof, EML.url="_url_to_EML_XML_file_", out.type="event", ask.input=TRUE))

Data objects

With 'omics datasets, it is no longer possible to display all data in a sample x variable table. There are additional dimensions to the data that also need to be taken into account, and the most obvious of these extra dimensions are the sequence files. But there is also other information, often referred to as metadata that are needed to contextualize the dataset. These include units of variables, project information, applied lab and sequencing protocols, links to the location of the sequence data, associated publications,...

To be able to work with these dimensions of data, the OmicsMetaData package created new S4 class objects. Slots of these objects can be accessed using an "@" instead of "$". The MIxS.metadata type object has slots for a.o. data (values), units, the MIxS environmental package name and the section of MIxS where each variable comes from. A DwC.event type object has slots for an event core and extensions, while a DwC.occurrence has an occurrence file at it's core. The validity of these objects can be checked with respectively check.valid.metadata.MIxS and check.valid.metadata.DwC. A MIxS.metadata object can also be written to a CSV file with the write.MIxS method

Archiving data

Archiving data, and getting in the right format and standard can be very challenging to say the least. Depending on the platform or database where the sequence data and metadata is archived, different requirements can be set and subtle variations to data standards implemented.

Therefore the OmicsMetaData package has a suite of function to format (meta-)data that can be uploaded directly to the European Nucleotide Archive (ENA, part of INSDC). Details on archiving data are set out in the Perpare_data_for_archiving vignette.

Synchronizing metadata and sequence files the names of sequence files can be extracted from a folder with the FileNames.to.Table function.

seq_files <- FileNames.to.Table("_path_to_seqFiles_")

These sequence files can also be compared to the sample names in a metadata file to ensure there are no discrepancies.

sync.metadata.sequenceFiles(row.names(main_metadata.MIxS@data), file.dir="_path_to_seqFiles_")

If necessary, sequence files can be renamed in bulk first, based on a dataframe with a column of old names and a column of corresponding new names.

names_dic <- data.frame(seq_facility_name = c("AA1234", "BB2345"), original_name = c("sampleLocation_year_1", "sampleLocation_year_2"))
renameSequenceFiles(names_dic, colNameOld="seq_facility_name", colNameNew="original_name", file.dir= "_path_to_seqFiles_")

To help with uploading 'omics dataset metadata to ENA, it can be formatted in a TSV file that is accepeted by ENA. See a detailed description on archiving data on ENA here: https://ena-docs.readthedocs.io/en/latest/submit/fileprep/upload.html

prep.metadata.ENA(metadata=a_metadata.MIxS_object, dest.dir=get.wd(), file.name="_a_file_name_")

Citation

Yilmaz, P., Gilbert, J., Knight, R. et al. The genomic standards consortium: bringing standards to life for microbial ecology. ISME J 5, 1565–1567 (2011). https://doi.org/10.1038/ismej.2011.39

Darwin Core Task Group. 2009. Darwin Core. Biodiversity Information Standards (TDWG) http://www.tdwg.org/standards/450

Chamberlain, S. (2020). worrms: World Register of Marine Species (WoRMS) Client. R package version 0.4.2. https://CRAN.R-project.org/package=worrms

Horton, T., Kroh, A., et al. (2021). World Register of Marine Species. Available from https://www.marinespecies.org at VLIZ. doi:10.14284/170

biodiversity-aq/OmicsMetaData documentation built on Dec. 19, 2021, 9:44 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

biodiversity-aq/OmicsMetaData
OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics (meta)data

In biodiversity-aq/OmicsMetaData: OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics (meta)data

OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics data

General Overview

Downloading data

Standardizing data

Data objects

Archiving data

Citation

R Package Documentation

Browse R Packages

We want your feedback!

biodiversity-aq/OmicsMetaData OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics (meta)data

In biodiversity-aq/OmicsMetaData: OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics (meta)data

OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics data

General Overview

Downloading data

Standardizing data

Data objects

Archiving data

Citation

R Package Documentation

Browse R Packages

We want your feedback!

biodiversity-aq/OmicsMetaData
OmicsMetaData: an R-package for interoperable and re-usable biodiversity 'omics (meta)data