View source: R/parse_annotations.R
parse_annotations | R Documentation |
parse_annotations
Peptide and gff3 gene annotation matching and
conversion to .bed format. Speeds up downstream compute and catches problems
with annotation files. This is NOT required for GENESPACE, but does help
get the input files in order. There are many other methods to convert
gff3 –> bed and match the names with fasta headers.
parse_annotations
parse gff into a bed format with one entry per
primary transcript, and ensure that the peptide fasta headers match the
name column
match_fasta2gff
engine for reading, parsing and writing annotation
files
parse_ncbi
a shortcut for parse_annotations(preset = "ncbi") to
maintain backwards compatibility with < v1.0.0.
parse_phytozome
a shortcut for parse_annotations(preset = "phytozome")
to maintain backwards compatibility with < v1.0.0.
parse_faHeader
function to maintain backwards compatibility
parse_annotations(
rawGenomeRepo,
genomeDirs,
genomeIDs = genomeDirs,
gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
genespaceWd,
minPepLen = 0,
dropDuplicates = FALSE,
removeNonAAs = TRUE,
presets = "none",
gffIdColumn = "Name",
headerEntryIndex = 4,
headerSep = " ",
gffStripText = "",
headerStripText = "locus=",
convertSpecialCharacters = "_",
chrIdDictionary = NULL,
troubleShoot = FALSE,
overwrite = FALSE
)
match_fasta2gff(
path2fasta,
path2gff,
genespaceWd,
genomeID,
presets,
gffIdColumn,
headerEntryIndex,
headerSep,
minPepLen,
dropDuplicates,
removeNonAAs,
gffStripText,
headerStripText,
chrIdDictionary,
convertSpecialCharacters,
troubleShoot
)
parse_ncbi(
rawGenomeRepo,
genomeDirs,
genomeIDs = genomeDirs,
gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
genespaceWd,
troubleShoot = FALSE,
...
)
parse_phytozome(
rawGenomeRepo,
genomeDirs,
genomeIDs = genomeDirs,
gffString = "gff$|gff3$|gff3\\.gz$|gff\\.gz",
faString = "fa$|fasta$|faa$|fa\\.gz$|fasta\\.gz|faa\\.gz",
genespaceWd,
troubleShoot = FALSE,
...
)
parse_faHeader(...)
rawGenomeRepo |
file path to the location of gff3 and fasta annotations |
genomeDirs |
character vector giving exact matches to subdirectories in rawGenome repo. |
genomeIDs |
character vector of length equal to genomeDirs. By default, takes values from genomeDirs, but, if specified, re-names the files accordingly. Useful if you want to shorten the names of genomes in your GENESPACE run. |
gffString |
regular expression of length 1 specifying the string to search for in rawGenomeRepo/genomeDirs that will exactly match the gff3- formatted annotation. Default is any text or .gz file ending in gff3, gff. |
faString |
same as gffString but for the fasta-formatted peptide annotation. Default is any text or .gz file ending in fa, fasta or faa. |
genespaceWd |
file.path of length 1 specifying the GENESPACE working directory. Will make two subdirectories: /bed and /peptide for the parsed annotations |
minPepLen |
numeric, specifying the shortest peptide (in daltons) to be kept |
dropDuplicates |
logical, should only one of a set of duplicated peptide sequences be kept? |
removeNonAAs |
logial, should "." and "-" characters be stripped from the amino acids? |
presets |
character string: "none", "phytozome" or "ncbi" which sets the below parameters to parse phytozome or ncbi-formatted annotations correctly. See details. |
gffIdColumn |
character, specifying the field name in the gff3 attributes column. |
headerEntryIndex |
integer specifying the field index in the fasta header which contains the gene ID information to match with the gff. |
headerSep |
character used as a field delimiter in the fasta header. |
gffStripText |
regular expression of length 1 specifying a gsub command to remove text from the gff ID. |
headerStripText |
like gffStripText, but for the fasta header |
convertSpecialCharacters |
Character string with a non-special character of length 1. Replaces special characters (punctionation other than ".", "-", and "_") if they are present in the gene IDs. |
chrIdDictionary |
a named vector where the names are the values in the first ("seqnames") gff3 column and the element names in the vector are the values to replace. |
troubleShoot |
logical, should the raw and parsed files be printed? |
overwrite |
logical, should existing files be overwritten? |
path2fasta |
deprecated, kept to maintain backwards compatibility |
path2gff |
deprecated, kept to maintain backwards compatibility |
genomeID |
single genomeID to consider |
... |
additional arguments passed on |
parse_annotations assumes that you have a 'rawGenomeRepo' directory that contains a subdirectory for each genome to parse. These subdirectory names are given in "genomeDirs". So, if rawGenomeRepo = "~/Destop/genomeRepo" and genomeDirs = c("human", "mouse"), then parse_annotations assumes there are two directories: ~/Desktop/genomeRepo/human and ../mouse. Each of these dicectories must contain a gff3-formatted gene annotation and a fasta- formatted peptide annotation. These annotation files can be further nested in the subdirectories, but each must be named with a uniquely findable "gffString" and "faString". If multiple (or no) files match these strings, an error will be returned.
Given differences in how gff3 and fasta headers are constructed, there are a number of parameters to choose how the files should be matched. Unless the files come from phytozome or NCBI, these need to be chosen manually. If you have differently named or formatted files, you can run parse_annotations several times with different parameters and paths to the files.
For each genomeDir, the pair of files are read in, parsed, matched, then written to $genespaceWd/$genomeID/peptide and $genespaceWd/$genomeID/bed respectively. By default, the genomeID is the same as the genomeDir, but this can be customized.
Presets: In the case of "ncbi", this assumes that you have downloaded the 'translated_cds' peptide file and gene.gff3 annotation. Given this file, it uses the following parameters: gffIdColumn <- "gene"; headerEntryIndex <- 2; headerSep <- " "; gffStripText <- ""; headerStripText <- "gene=|\[|\]" Present 'ncbi" also builds a chrIdDictionary to re-name sequenceIDs based on entries labeled "chromosome" in the third gff3 column.
Phytozome presets: gffIdColumn <- "Name"; headerEntryIndex <- 4; headerSep <- " "; gffStripText <- ""; headerStripText <- "locus="
a data.table containing the file paths to the raw and parsed annotations.
## Not run:
# coming soon
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.