parse_annotations: Accessory function to help build GENESPACE input files
In jtlovell/GENESPACE: Synteny- and orthology-constrained comparative genomics

parse_annotations

R Documentation

Accessory function to help build GENESPACE input files

Description

parse_annotations Peptide and gff3 gene annotation matching and conversion to .bed format. Speeds up downstream compute and catches problems with annotation files. This is NOT required for GENESPACE, but does help get the input files in order. There are many other methods to convert gff3 –> bed and match the names with fasta headers.

parse_annotations parse gff into a bed format with one entry per primary transcript, and ensure that the peptide fasta headers match the name column

match_fasta2gff engine for reading, parsing and writing annotation files

parse_ncbi a shortcut for parse_annotations(preset = "ncbi") to maintain backwards compatibility with < v1.0.0.

parse_phytozome a shortcut for parse_annotations(preset = "phytozome") to maintain backwards compatibility with < v1.0.0.

parse_faHeader function to maintain backwards compatibility

Usage

parse_annotations(
  rawGenomeRepo,
  genomeDirs,
  genomeIDs = genomeDirs,
  gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
  faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
  genespaceWd,
  minPepLen = 0,
  dropDuplicates = FALSE,
  removeNonAAs = TRUE,
  presets = "none",
  gffIdColumn = "Name",
  headerEntryIndex = 4,
  headerSep = " ",
  gffStripText = "",
  headerStripText = "locus=",
  convertSpecialCharacters = "_",
  chrIdDictionary = NULL,
  troubleShoot = FALSE,
  overwrite = FALSE
)

match_fasta2gff(
  path2fasta,
  path2gff,
  genespaceWd,
  genomeID,
  presets,
  gffIdColumn,
  headerEntryIndex,
  headerSep,
  minPepLen,
  dropDuplicates,
  removeNonAAs,
  gffStripText,
  headerStripText,
  chrIdDictionary,
  convertSpecialCharacters,
  troubleShoot
)

parse_ncbi(
  rawGenomeRepo,
  genomeDirs,
  genomeIDs = genomeDirs,
  gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
  faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
  genespaceWd,
  troubleShoot = FALSE,
  ...
)

parse_phytozome(
  rawGenomeRepo,
  genomeDirs,
  genomeIDs = genomeDirs,
  gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
  faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
  genespaceWd,
  troubleShoot = FALSE,
  ...
)

parse_faHeader(...)

Arguments

`rawGenomeRepo`	file path to the location of gff3 and fasta annotations
`genomeDirs`	character vector giving exact matches to subdirectories in rawGenome repo.
`genomeIDs`	character vector of length equal to genomeDirs. By default, takes values from genomeDirs, but, if specified, re-names the files accordingly. Useful if you want to shorten the names of genomes in your GENESPACE run.
`gffString`	regular expression of length 1 specifying the string to search for in rawGenomeRepo/genomeDirs that will exactly match the gff3- formatted annotation. Default is any text or .gz file ending in gff3, gff.
`faString`	same as gffString but for the fasta-formatted peptide annotation. Default is any text or .gz file ending in fa, fasta or faa.
`genespaceWd`	file.path of length 1 specifying the GENESPACE working directory. Will make two subdirectories: /bed and /peptide for the parsed annotations
`minPepLen`	numeric, specifying the shortest peptide (in daltons) to be kept
`dropDuplicates`	logical, should only one of a set of duplicated peptide sequences be kept?
`removeNonAAs`	logial, should "." and "-" characters be stripped from the amino acids?
`presets`	character string: "none", "phytozome" or "ncbi" which sets the below parameters to parse phytozome or ncbi-formatted annotations correctly. See details.
`gffIdColumn`	character, specifying the field name in the gff3 attributes column.
`headerEntryIndex`	integer specifying the field index in the fasta header which contains the gene ID information to match with the gff.
`headerSep`	character used as a field delimiter in the fasta header.
`gffStripText`	regular expression of length 1 specifying a gsub command to remove text from the gff ID.
`headerStripText`	like gffStripText, but for the fasta header
`convertSpecialCharacters`	Character string with a non-special character of length 1. Replaces special characters (punctionation other than ".", "-", and "_") if they are present in the gene IDs.
`chrIdDictionary`	a named vector where the names are the values in the first ("seqnames") gff3 column and the element names in the vector are the values to replace.
`troubleShoot`	logical, should the raw and parsed files be printed?
`overwrite`	logical, should existing files be overwritten?
`path2fasta`	deprecated, kept to maintain backwards compatibility
`path2gff`	deprecated, kept to maintain backwards compatibility
`genomeID`	single genomeID to consider
`...`	additional arguments passed on

Details

parse_annotations assumes that you have a 'rawGenomeRepo' directory that contains a subdirectory for each genome to parse. These subdirectory names are given in "genomeDirs". So, if rawGenomeRepo = "~/Destop/genomeRepo" and genomeDirs = c("human", "mouse"), then parse_annotations assumes there are two directories: ~/Desktop/genomeRepo/human and ../mouse. Each of these dicectories must contain a gff3-formatted gene annotation and a fasta- formatted peptide annotation. These annotation files can be further nested in the subdirectories, but each must be named with a uniquely findable "gffString" and "faString". If multiple (or no) files match these strings, an error will be returned.

Given differences in how gff3 and fasta headers are constructed, there are a number of parameters to choose how the files should be matched. Unless the files come from phytozome or NCBI, these need to be chosen manually. If you have differently named or formatted files, you can run parse_annotations several times with different parameters and paths to the files.

For each genomeDir, the pair of files are read in, parsed, matched, then written to $genespaceWd/$genomeID/peptide and $genespaceWd/$genomeID/bed respectively. By default, the genomeID is the same as the genomeDir, but this can be customized.

Presets: In the case of "ncbi", this assumes that you have downloaded the 'translated_cds' peptide file and gene.gff3 annotation. Given this file, it uses the following parameters: gffIdColumn <- "gene"; headerEntryIndex <- 2; headerSep <- " "; gffStripText <- ""; headerStripText <- "gene=|\[|\]" Present 'ncbi" also builds a chrIdDictionary to re-name sequenceIDs based on entries labeled "chromosome" in the third gff3 column.

Phytozome presets: gffIdColumn <- "Name"; headerEntryIndex <- 4; headerSep <- " "; gffStripText <- ""; headerStripText <- "locus="

Value

a data.table containing the file paths to the raw and parsed annotations.

Examples

## Not run: 
# coming soon

## End(Not run)

jtlovell/GENESPACE documentation built on Jan. 25, 2025, 6:39 a.m.

jtlovell/GENESPACE index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jtlovell/GENESPACE
Synteny- and orthology-constrained comparative genomics

parse_annotations: Accessory function to help build GENESPACE input files
In jtlovell/GENESPACE: Synteny- and orthology-constrained comparative genomics

Accessory function to help build GENESPACE input files

Description

Usage

Arguments

Details

Value

Examples

Related to parse_annotations in jtlovell/GENESPACE...

R Package Documentation

Browse R Packages

We want your feedback!

jtlovell/GENESPACE Synteny- and orthology-constrained comparative genomics

parse_annotations: Accessory function to help build GENESPACE input files In jtlovell/GENESPACE: Synteny- and orthology-constrained comparative genomics

Accessory function to help build GENESPACE input files

Description

Usage

Arguments

Details

Value

Examples

Related to parse_annotations in jtlovell/GENESPACE...

R Package Documentation

Browse R Packages

We want your feedback!

jtlovell/GENESPACE
Synteny- and orthology-constrained comparative genomics

parse_annotations: Accessory function to help build GENESPACE input files
In jtlovell/GENESPACE: Synteny- and orthology-constrained comparative genomics