init_genespace: Find files and directories for a GENESPACE run

View source: R/init_genespace.R

init_genespaceR Documentation

Find files and directories for a GENESPACE run

Description

init_genespace Searches for desired genome files in the raw genome repo director.

Usage

init_genespace(
  wd,
  genomeIDs = NULL,
  ploidy = 1,
  ignoreTheseGenomes = NULL,
  path2orthofinder = "orthofinder",
  path2diamond = "diamond",
  path2mcscanx = "MCScanX",
  onewayBlast = FALSE,
  orthofinderInBlk = any(ploidy > 1),
  useHOGs = TRUE,
  rawOrthofinderDir = NA,
  diamondUltraSens = FALSE,
  nCores = min(c(detectCores()/2, 16)),
  maxOgPlaces = 8,
  blkSize = 5,
  nGaps = 5,
  blkRadius = blkSize * 5,
  synBuff = 100,
  arrayJump = ceiling(synBuff/2),
  onlyOgAnchors = TRUE,
  nSecondaryHits = 0,
  nGapsSecond = nGaps * 2,
  blkSizeSecond = blkSize,
  blkRadiusSecond = blkRadius,
  onlyOgAnchorsSelf = TRUE,
  onlyOgAnchorsSecond = FALSE,
  maskBuffer = 500,
  onlySameChrs = FALSE,
  dotplots = "check",
  outgroup = ignoreTheseGenomes,
  nSecondHits = nSecondaryHits,
  synBuffSecond = NULL,
  orthofinderMethod = NULL,
  speciesIDs = NULL,
  minPepLen = NULL,
  versionIDs = NULL,
  rawGenomeDir = NULL,
  diamondMode = NULL,
  overwrite = NULL,
  gffString = NULL,
  pepString = NULL,
  verbose = NULL
)

Arguments

wd

file.path where the analysis will be run

genomeIDs

character vector of length > 1, matching length of speciesIDs, versions and ploidy. Specifies the name to assign to each genome. This vector must be unique and can be any string that begins with a letter (a-z, A-Z) and is alphanumeric. '.' and '_' are allowed as long as they are not the first character.

ploidy

integer string specifying ploidy of genome assemblies. This is usually half of the actual ploidy, that is an inbred diploid usually is represented by a haploid genome assembly.

ignoreTheseGenomes

character string matching one of the genomeIDs that will be used in the orthofinder -og run but not in the synteny search. Suggested to ensure that there is an outgroup that predates any WGD that the user would like to study.

path2orthofinder

character string coercible to a file path that points to the orthofinder executable. If orthofinder is in the path, specify with "orthofinder"

path2diamond

character string coercible to a file path that points to the diamond executable. If diamond is in the path, specify with "diamond"

path2mcscanx

see path2orthofinder, except to the mcscanx directory. This must contain the MCScanX_h folder.

onewayBlast

logical of length 1, specifying whether one-way blasts should be run via 'orthofinder -1 ...'. This replaces orthofinderMethod = "fast", but uses 'diamond2 –more-sensitive' whereas the previous method used –fast specification. Substantial speed improvements in large runs with little loss of fidelity.

orthofinderInBlk

logical, should orthofinder be re-run within syntenic regions? Highly recommended for polyploids. When called, HOGs within blocks replace global HOGs or OGs. See useHOGs for more information.

useHOGs

logical of length 1 or NA, specifying whether to use phylogenetically hierarchical orthogroups (HOGs) or raw orthogroups. By default (NA), this is decided internally by 'annotate_bed', where the orthogroup type with members that best match the genome ploidy is used. In general, HOGs should be used for any run where all genomes are haploid, since they have been shown to have ~20 However, in cases where we want both homeologs, HOGs may be problematic and probably should not be used for syntenic region calculations. That said, HOGs are always used for within-block orthofinder, which is also the default when any genomes have ploidy > 1. So, the only way to use the deprecated orthogroups.tsv for pan-genome calculation is to set useHOGs = FALSE AND orthofinderInBlk = FALSE.

rawOrthofinderDir

file.path of length 1, specifying the location of an existing raw orthofinder run. Defaults to the $wd/orthofinder, but can be any path point to a valid orthofinder run. If not a valid path, this is ignored.

diamondUltraSens

logical of length 1, specifying whether the diamond mode run within orthofinder should be –more-sensitive (default, FALSE) or –ultra-sensitive.

nCores

integer of length 1 specifying the number of parallel processes to run

maxOgPlaces

integer of length 1, specifying the max number of unique placements that an orthogroup can have before being excluded from synteny

blkSize

integer of length 1, specifying the -s param to mcscanx

nGaps

integer of length 1, specifying the -m param to mcscanx for the primary MCScanX run. This acts on the results from the initial MCScanX run.

blkRadius

integer of length 1, specifying the search radius in 2d clustering to assign hits to the same block. This is a sensitive parameter as smaller values will result in more blocks, gaps and SV. Typically using 2x or greater blkSize is fine.

synBuff

Numeric > 0, specifying the distance from an anchor to consider a hit syntenic. This parameter is also used to limit the search radius in dbscan-based blk calculation. Larger values will return larger tandem arrays but also may permit inclusion of spurious non-syntenic networks

arrayJump

integer of length 1, specifying the maximum distance in gene rank order between two genes in the same tandem array

onlyOgAnchors

logical, should only hits in orthogroups be considered for anchors?

nSecondaryHits

integer of length 1, specifying the number of secondary hits to look for after masking the primary syntenic regions

nGapsSecond

see nGaps, but passed to secondary hits after masking primary hits.

blkSizeSecond

see blkSize, but passed to the secondary scan if nSecondaryHits > 0.

blkRadiusSecond

see blkRadius, but passed to the secondary scan if nSecondaryHits > 0.

onlyOgAnchorsSelf

logical, should only hits in orthogroups be considered for anchors in self-hits (particularly polyploids)

onlyOgAnchorsSecond

logical should only hits in orthogroups be considered for anchors in secondary blocks?

maskBuffer

numeric (default = 500), the minimum distance that a secondary (or homeolog w/in polyploid genome) block can be created relative to an existing block.

onlySameChrs

logical - should synteny be only considered between chromosomes with the same name?

dotplots

character string either "always", "never", or "check". Default (check) only writes a dotplot if there are < 10k unique chromosome combinations (facets). "always" means that dotplots are made regardless of facet numbers, which can be very slow in some instances. "never" is by far the fastest method, but also never produces dotplots.

outgroup

deprecated in V1. See ignoreTheseGenomes.

nSecondHits

integer of length 1, specifying the number of blast hits to include after masking.

synBuffSecond

see syntenyBuffer. Applied only to synteny construction of secondary hits.

orthofinderMethod

deprecated in V1. See onewayBlast.

speciesIDs

deprecated in V1. See 'parse_annotations'.

minPepLen

deprecated in V1. All genes in the peptide fasta are used.

versionIDs

deprecated in V1. See 'parse_annotations'.

rawGenomeDir

deprecated in V1. See 'parse_annotations'.

diamondMode

deprecated in V1. 'fast' mode is no longer available. –ultra-sensitive is available via diamondUltraSens.

overwrite

deprecated in V1. Results are never over-written.

gffString

deprecated in V1. See 'parse_annotations'.

pepString

deprecated in V1. See 'parse_annotations'.

verbose

deprecated in V1. All updates are printed to the console

Details

Simple directory parser to find and check the paths to all annotation and assembly files.

Value

A list containing paths to the raw files. If a file is not found, path is returned as null and a warning is printed.

Examples

## Not run: 
# coming soon

## End(Not run)


jtlovell/GENESPACE documentation built on Jan. 25, 2025, 6:39 a.m.