parse_annotations: Accessory function to help build GENESPACE input files

View source: R/parse_annotations.R

parse_annotationsR Documentation

Accessory function to help build GENESPACE input files

Description

parse_annotations Peptide and gff3 gene annotation matching and conversion to .bed format. Speeds up downstream compute and catches problems with annotation files. This is NOT required for GENESPACE, but does help get the input files in order. There are many other methods to convert gff3 –> bed and match the names with fasta headers.

parse_annotations parse gff into a bed format with one entry per primary transcript, and ensure that the peptide fasta headers match the name column

match_fasta2gff engine for reading, parsing and writing annotation files

parse_ncbi a shortcut for parse_annotations(preset = "ncbi") to maintain backwards compatibility with < v1.0.0.

parse_phytozome a shortcut for parse_annotations(preset = "phytozome") to maintain backwards compatibility with < v1.0.0.

parse_faHeader function to maintain backwards compatibility

Usage

parse_annotations(
  rawGenomeRepo,
  genomeDirs,
  genomeIDs = genomeDirs,
  gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
  faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
  genespaceWd,
  minPepLen = 0,
  dropDuplicates = FALSE,
  removeNonAAs = TRUE,
  presets = "none",
  gffIdColumn = "Name",
  headerEntryIndex = 4,
  headerSep = " ",
  gffStripText = "",
  headerStripText = "locus=",
  convertSpecialCharacters = "_",
  chrIdDictionary = NULL,
  troubleShoot = FALSE,
  overwrite = FALSE
)

match_fasta2gff(
  path2fasta,
  path2gff,
  genespaceWd,
  genomeID,
  presets,
  gffIdColumn,
  headerEntryIndex,
  headerSep,
  minPepLen,
  dropDuplicates,
  removeNonAAs,
  gffStripText,
  headerStripText,
  chrIdDictionary,
  convertSpecialCharacters,
  troubleShoot
)

parse_ncbi(
  rawGenomeRepo,
  genomeDirs,
  genomeIDs = genomeDirs,
  gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
  faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
  genespaceWd,
  troubleShoot = FALSE,
  ...
)

parse_phytozome(
  rawGenomeRepo,
  genomeDirs,
  genomeIDs = genomeDirs,
  gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
  faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
  genespaceWd,
  troubleShoot = FALSE,
  ...
)

parse_faHeader(...)

Arguments

rawGenomeRepo

file path to the location of gff3 and fasta annotations

genomeDirs

character vector giving exact matches to subdirectories in rawGenome repo.

genomeIDs

character vector of length equal to genomeDirs. By default, takes values from genomeDirs, but, if specified, re-names the files accordingly. Useful if you want to shorten the names of genomes in your GENESPACE run.

gffString

regular expression of length 1 specifying the string to search for in rawGenomeRepo/genomeDirs that will exactly match the gff3- formatted annotation. Default is any text or .gz file ending in gff3, gff.

faString

same as gffString but for the fasta-formatted peptide annotation. Default is any text or .gz file ending in fa, fasta or faa.

genespaceWd

file.path of length 1 specifying the GENESPACE working directory. Will make two subdirectories: /bed and /peptide for the parsed annotations

minPepLen

numeric, specifying the shortest peptide (in daltons) to be kept

dropDuplicates

logical, should only one of a set of duplicated peptide sequences be kept?

removeNonAAs

logial, should "." and "-" characters be stripped from the amino acids?

presets

character string: "none", "phytozome" or "ncbi" which sets the below parameters to parse phytozome or ncbi-formatted annotations correctly. See details.

gffIdColumn

character, specifying the field name in the gff3 attributes column.

headerEntryIndex

integer specifying the field index in the fasta header which contains the gene ID information to match with the gff.

headerSep

character used as a field delimiter in the fasta header.

gffStripText

regular expression of length 1 specifying a gsub command to remove text from the gff ID.

headerStripText

like gffStripText, but for the fasta header

convertSpecialCharacters

Character string with a non-special character of length 1. Replaces special characters (punctionation other than ".", "-", and "_") if they are present in the gene IDs.

chrIdDictionary

a named vector where the names are the values in the first ("seqnames") gff3 column and the element names in the vector are the values to replace.

troubleShoot

logical, should the raw and parsed files be printed?

overwrite

logical, should existing files be overwritten?

path2fasta

deprecated, kept to maintain backwards compatibility

path2gff

deprecated, kept to maintain backwards compatibility

genomeID

single genomeID to consider

...

additional arguments passed on

Details

parse_annotations assumes that you have a 'rawGenomeRepo' directory that contains a subdirectory for each genome to parse. These subdirectory names are given in "genomeDirs". So, if rawGenomeRepo = "~/Destop/genomeRepo" and genomeDirs = c("human", "mouse"), then parse_annotations assumes there are two directories: ~/Desktop/genomeRepo/human and ../mouse. Each of these dicectories must contain a gff3-formatted gene annotation and a fasta- formatted peptide annotation. These annotation files can be further nested in the subdirectories, but each must be named with a uniquely findable "gffString" and "faString". If multiple (or no) files match these strings, an error will be returned.

Given differences in how gff3 and fasta headers are constructed, there are a number of parameters to choose how the files should be matched. Unless the files come from phytozome or NCBI, these need to be chosen manually. If you have differently named or formatted files, you can run parse_annotations several times with different parameters and paths to the files.

For each genomeDir, the pair of files are read in, parsed, matched, then written to $genespaceWd/$genomeID/peptide and $genespaceWd/$genomeID/bed respectively. By default, the genomeID is the same as the genomeDir, but this can be customized.

Presets: In the case of "ncbi", this assumes that you have downloaded the 'translated_cds' peptide file and gene.gff3 annotation. Given this file, it uses the following parameters: gffIdColumn <- "gene"; headerEntryIndex <- 2; headerSep <- " "; gffStripText <- ""; headerStripText <- "gene=|\[|\]" Present 'ncbi" also builds a chrIdDictionary to re-name sequenceIDs based on entries labeled "chromosome" in the third gff3 column.

Phytozome presets: gffIdColumn <- "Name"; headerEntryIndex <- 4; headerSep <- " "; gffStripText <- ""; headerStripText <- "locus="

Value

a data.table containing the file paths to the raw and parsed annotations.

Examples

## Not run: 
# coming soon

## End(Not run)



jtlovell/GENESPACE documentation built on Jan. 25, 2025, 6:39 a.m.