findPhysLink: Testing for associations and physical linkage between every...

View source: R/func__findPhysLink.R

findPhysLinkR Documentation

Testing for associations and physical linkage between every pair of bacterial genes at the allele level

Description

This function has two behaviours. It always tests for associations between alleles of bacterial genes with univariate linear mixed models. In addition, when allelic physical distances are provided, it determines evidence of physical linkage between the alleles. This is the main function of the package GeneMates. For this package, the term "sample" refers to either a bacterial isolate or a strain.

Dependent libraries: data.table, parallel, ape, phytools

Usage

findPhysLink(
  assoc.out = NULL,
  snps = NULL,
  snps.delim = ",",
  pos.col = "Pos",
  ref.col = "Ref",
  min.mac = 1,
  genetic.pam = NULL,
  genetic.pam.delim = "\t",
  genes.excl = NULL,
  allelic.pam = NULL,
  allelic.pam.delim = "\t",
  min.count = 2,
  min.co = 2,
  mapping = NULL,
  phys.dists = NULL,
  dist.delim = "\t",
  max.node.num = NULL,
  max.dist = NULL,
  ingroup = NULL,
  outliers = NULL,
  ref = NULL,
  tree = NULL,
  sample.dists = NULL,
  d.qs = c(0, 0.25, 0.5, 0.75, 1),
  max.p = 0.05,
  max.range = 2000,
  min.pIBD = 0.9,
  output.dir = "output",
  prefix = NULL,
  gemma.path = "gemma",
  n.cores = -1,
  save.stages = TRUE,
  del.temp = TRUE,
  skip = TRUE
)

Arguments

assoc.out

A previous output of findPhysLink when it was used only for association tests. Equivalently, this output is the same as the function lmm. This list may not include the large element snps for convenience. This parameter is useful when users need to incorporate distance information into the result of association analysis. Other arguments, such as snps, snps.delim, ..., allelic.pam and genetic.pam, etc., will not be used when this argument is valid.

snps

Core-genome SNPs used for estimating the relatedness matrix. Valid values: a complete path to a SNP table, or an un-centred, encoded, biallelic SNP matrix G

snps.delim

(optional) Delimiters of fields in the SNP table (pam.delim and dist.delim are defined similary)

pos.col

(optional) An integer (column index) or a string (column name) specifying which column contains SNP positions

ref.col

(optional) A string specifying the column for SNPs of the reference genome.

min.mac

(optional) An integer specifying the minimal number of times required for the minor allele of every biallelic SNP to occur across all isolates. SNPs failed this criterion will be removed from this analysis.

genetic.pam

A presence/absence matrix of genes. It may be a compiled table from SRST2.

genetic.pam.delim

(optional) A delimiter character in the genetic PAM. Default: tab.

genes.excl

(optional) Genes to be excluded from PAMs. For example, genes.excl = c("AmpH_Bla", "OqxBgb_Flq", "OqxA_Flq", "SHV.OKP.LEN_Bla").

allelic.pam

A presence/absence matrix of alleles. The matrix may be a compiled table of SRST2 results.

allelic.pam.delim

(optional) A delimiter character in the allelic PAM. Default: tab.

min.count

(option) The minimum count of alleles/genes in the current data set to be included for analysis.

min.co

(optional) The minimal number of allelic co-occurrence events.

mapping

(optional) A data frame mapping alleles to genes and patterns, etc, which equals the "mapping" element within findPhysLink's output list. This argument is only used when a user reruns a previous analysis.

phys.dists

A table of physical distances between targets. GeneMates matches sample names between this table and those in the SNP matrix, hence it is okay to have extra or less samples in phys.dists.

dist.delim

The delimit character in the table of physical distances. Default: tab.

max.node.num

(optional) An integer specifying the maximal number of nodes per path in which we can trust their distance measurements. Set max.node.num = NULL to turn off this filter of distance measurements.

max.dist

(optional) An inclusive upper bound for filterring distance measurements. Measurements above this threshold will be ignored. Set it to NULL to turn off this filter.

ingroup

(optional) A vector of characters for names of isolates to be analysed. Isolates may be sorted, such as according to the phylogeny. The function includes all isolates by default.

outliers

(optional) A vector of characters for isolate/strain names to be excluded from snps, pam and ds

ref

(optional) A new name for the reference genome. The column name specified by ref.col in the SNP matrix will be replaced with this argument.

tree

(optional) A path to a tree file or a phylo object for a tree of all samples. The format of the tree file must be compartile to the read.tree function in the ape package.

sample.dists

(optional) A numeric matrix of distances between samples. The distances can be Euclidean distances between projections, phylogenetic tip distances or SNP distances (the number of SNPs between any two samples). A matrix of Euclidean distances will be computed for projections of samples if this option is left NULL.

d.qs

(optional) Quantile probabilities for allelic distance measurements. Default: minimum (0), the first quantile (0.25), median (0.5), the third quantile (0.75) and the maximum (1).

max.p

(optional) The upper bound of P values to determine significance. P <= max.p will be called significant.

max.range

An integer specifying the maximum range (in bp) of in-group allelic physical distances allowed to call consistent.

min.pIBD

(optional) Minimum probability of the root of a minimum inclusive clade to display a positive binary trait, such as having a pair of alleles co-occurring or a specific allelic physical distance. Default: 0.9 (90%).

output.dir

(optional) Path of the output directory. A relative path "output" under the current working directory is recommended as GEMMA always create a directory named output to store its outputs. Otherwise, you will end up with two output directories: one for yours, and the other for GEMMA.

prefix

(optional) For names of all output files

gemma.path

Path to GEMMA. No forward slash should be attached at the end of the path.

n.cores

Number of cores used to run GEMMA in parallel where possible. -1: automatically detect the number of available cores N, but use N - 1 cores (recommended) 0: automatically detect the number of available cores and use all of them. Be careful when the current R session is not running through SLURM. >= 1: use the number of cores as specified. n.cores is reset to the maximal number of available cores N when n.cores > N.

save.stages

(optional) Whether to turn on stage control or not. Recommend to turn it on when you are not sure whether the pipeline will be finished smoothly.

del.temp

(optional) Whether to delete temporary files or not.

skip

(optional) Whether to avoid overwriting existing output files.

Author(s)

Yu Wan, wanyuac@126.com

Examples

time.start <- Sys.time()
assoc <- findPhysLink(snps = "input/noPhage_snps_1outgroup_var_regionFiltered_cons1.csv",
snps.delim = ",", pos.col = "Pos", allelic.pam = "input/allele_paMatrix_noHash_filtered.txt",
genetic.pam = "input/modified_allele_matrix_noHash_filtered.txt",
genes.excl = c("AMPH_Ecoli_Bla", "AmpC1_Ecoli_Bla", "AmpC2_Ecoli_Bla", "MrdA_Bla"),
phys.dists = "input/merged_dists_noHash.tsv", max.node.num = 2,
max.dist = 2.5e6, max.range = 2000, min.pIBD = 0.9, ingroup = NULL, outliers = "Outgroup",
min.mac = 1, min.co = 2, max.p = 0.05, output.dir = "output", prefix = "Ec",
gemma.path = "~/apps/gemma", n.cores = 8)
time.end <- Sys.time()
print(time.end - time.start)


wanyuac/GeneMates documentation built on Aug. 12, 2022, 7:37 a.m.