knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The ultimate source of the data hosted at OomyceteDB is a single spreadsheet file stored on Google Drive called oomycetedb_source. However, each time a release is made, the contents of this spreadsheet file are copied to the oomycetedbdata package as CSV files and uploaded to Github. The releases on Github are then referenced by code running on the oomycetedb.org website to host the data publicly. Since this is all done automatically, it is important that the format of the spreadsheet file on Google Drive is correct. One of the main functions of this package is to check that the spreadsheet is formatted correctly before releases are made. This document describes the format of the database spreadsheet.

Spreadsheet file format

The source file for the database stored on Google Drive is a Google Sheets file called oomycetedb_source. This is a file format specific to the Google cloud and must be converted to a different format in order to be downloaded. Therefore, it is probably best to edit the document online using Google Sheets rather than downloading the file as an Excel or ODF file and uploading, since the conversion might not be perfect. The file should have the following sheets, each containing a single table:

  1. sequence_data: A table with one row per sequence and associated information
  2. taxon_data: A table with one row per unique taxonomic ID and associated information

General table/sheet formatting guidelines

Since the spreadsheet file on Google Drive will generally be edited manually, it is understandable that the person editing it might want to make aesthetic formatting changes to make it more readable and easier to work with. However, it is important that any aesthetic formatting not interfere with the ability of the spreadsheet to be read by computers. Some changes, such as fonts, colors, cell widths/heights will not cause problems, but other changes will, such as extra columns/rows and merged cells.

Here are some examples of things that can be changed without causing problems:

These changes will probably cause problems and should be avoided:

Sequence data column format

The sequence_data sheet contains a table with one row per sequence and associated information. The following columns are required:

Taxon data column format

The taxon_data sheet contains a table with one row per unique taxonomic ID and associated information. All taxon IDs used in the sequence_data sheet must be present in this sheet (once for each unique ID), but the information for valid NCBI taxon IDs can be added automatically during validation. The following columns are required:

FASTA file format

The FASTA file is generated automatically from the releases of the database, so the user does not need to worry about the details of it format. However it is recorded here for reference.

>oodb_id=...|name=...|strain=...|genbank_id=...|taxon_id=...|classification=...

with the ... replaced with the appropriate content.



grunwaldlab/oomycetedbtools documentation built on March 23, 2022, 6:54 a.m.