GeneMates is an R package implementing a network approach to identify horizontal gene co-transfer (HGcoT) between bacteria using whole-genome sequencing (WGS) data. It is particularly useful for investigating intra-species HGcoT, where presence-absence status of acquired genes is usually confounded by bacterial population structure due to clonal reproduction.
Citation
We would appreciate your citation if you find GeneMates or its helper scripts helpful to your project:
This project was supported by the Department of Biochemistry and Molecular Biology, University of Melbourne, Victoria, Australia.
The latest stable release is v0.2.2 (21 March 2020). You may find previous versions on the Releases page.
There are two approaches to install this package. Take GeneMates v0.2.2 for example. Assuming that we are going to install the package under a user-specified directory Lib:
R
install.packages(pkgs = "GeneMates_0.2.2.tar.gz", lib = "Lib")
bash
The program Rscript should be accessible as a command. Namely, the path of R should be added to $PATH before hand.
./install_GeneMates.sh GeneMates_0.2.2.tar.gz Lib
Since dependent R packages are not maintained by the GeneMates team and R may not install them automatically as you install GeneMates, you may follow manuals of these packages to complete the installation.
We have developed the following scripts to help users to prepare input data for GeneMates as well as to validate results:
Function findPhysLink is a pivotal function and a major user interface of this package. Here we show a typical procedure for the usage of GeneMates. We assume this procedure is run on a computing server.
setwd("Analysis")
library(GeneMates)
tr <- read.tree("rooted.tree") # a maximum-likelihood tree
assoc <- findPhysLink(snps = "Analysis/snps.csv", snps.delim = ",", pos.col = "Pos", ref.pos = "Ref", min.mac = 1, genetic.pam = "gpam.tsv", genetic.pam.delim = "\t", allelic.pam = "apam.tsv", allelic.pam.delim = "\t", mapping = NULL, min.count = 15, phys.dists = "prioritised_dists.tsv", dist.delim = "\t", max.node.num = 2, max.dist = 250e3, ref = "ref", tree = tr, min.co = 2, d.qs = c(0, 0.25, 0.5, 0.75, 1), max.p = 0.05, max.range = 2000, min.pIBD = 0.9, output.dir = "Output", prefix = "demo", gemma.path = "~/Apps/gemma", n.cores = 8, save.stages = TRUE, del.temp = FALSE, skip = TRUE)
snps <- assoc[["snps"]] # a large list
saveRDS(snps, file = "Out/snps.rds") # Analysis/Out/snps.rds
assoc <- assoc[-which(names(assoc) == "snps")]
saveRDS(assoc, file = "Out/assoc.rds") # Analysis/Out/assoc.rds
The element "snps" in the result list is usually too large to be loaded to an R session when the sample size or SNP number is large. Therefore we recommend to save the result list in two files.
This section illustrates organisation of public functions in GeneMates.
findPhysLink: the main function of GeneMates. It has the following subordinate functions.
The following functions analyse the output network of findPhysLink.
GeneMates comprises functions analysing spatial and temporal distributions of bacterial isolates and alleles of genes of interest.
Functions under this category are developed for helping users to extract and inspect specific aspects of results from findPhysLink.
GeneMates takes as input four kinds of data for detection of HGcoT. A function is created for importing each kind of data. This section explain the usage of these four functions.
Function importCoreGenomeSNPs reads a cgSNP table, encodes SNP genotypes, extracts biallelic SNPs and performs zero-centring of encoded SNP genotypes. The behaviour of this function is similar to the function get_SNP_data in R package BugWAS (see BUGWAS_modular.R). The function importCoreGenomeSNPs expects the SNP table to follow the output format of the script parseSNPtable.py in a read-mapping pipeline RedDog. Accordingly, the SNP table should be stored as a CSV file by default and an example of its structure is shown as follows.
| Pos,Ref,Isolate1,Isolate2,Isolate3,... | | -------------------------------------- | | 10,A,A,A,A,... | | 21,C,C,C,C,... | | 25,C,C,C,T,... | | ... |
Here, the hyphen "-" denotes the SNP site at the 21st base (Pos) of the reference genome "Ref" is not present in Isolate2.
Because of the code
snps.var <- apply(snps.core, 2, function(x) length(unique(x)))
snps.bi <- snps.core[, as.integer(snps.var) == 2]
snp.alleles <- .getAllelePairs(snps.bi)
G <- .encodeAlleles(snps.bi, snp.alleles)
function importCoreGenomeSNPs only works correctly when all SNPs are detected in all ingroup genomes (namely, cgSNPs). In other words, missing SNP genotypes (each is denoted by a hyphen) create false genotype counts. It is not necessary to address this limitation for GeneMates because our estimation of bacterial population structure only relies on biallelic SNP sites that are found in all ingroup isolates.
Function importCoreGenomeSNPs returns a large list to R. Cautions must be taken to run this function due to the large size of the output list: users are advised to check their computer capacity in the first place. Elements of the output list are listed as follows.
GeneMates provides function importAllelicPAM to read an allelic PAM, which is a binary matrix denoting presence (1) and absence (0) of alleles of the genes of interest across bacterial isolates. The PAM can be created using a helper pipeline PAMmaker and it has the following structure.
| Sample | Allele 1 | Allele 2 | Allele_3 |... | | ----- | -----| ----- | ----- | ----- | | Isolate_1 | 1 | 1 | 0 | ... | | Isolate_2 | 1 | 0 | 0 | ... | | ... | ... | ... | ... | ... |
Note that only the first column name "Sample" is fixed in every allelic PAM.
Users may use the R command ?importAllelicPAM
to see an argument list with explanations. The function carries out steps as follows.
Function importAllelicPAM returns a large list of seven elements.
Function importGeneticPAM reads a genetic PAM into R. A genetic PAM stores more information than its corresponding allelic PAM: it is a character matrix showing allele calls per gene of interest in bacterial isolates. Every genetic PAM has the following structure.
| Sample | Gene 1 | Gene 2 | Gene 3 | ... | | ----- | ----- | ----- | ----- | ----- | | Isolate 1 | Allele 1_1 | - | - | ... | | Isolate 2 | Allele 1_1 | Allele 2_1 | Allele 3_1 | ... | | Isolate 3 | - | Allele 2_2 | - | ... | | ... | ... | ... | ... | ... |
Note that only the first column name "Sample" is fixed in every genetic PAM. In addition, each hyphen denotes absence of any alleles of a gene in an isolate.
The function importGeneticPAM postulates that the genetic PAM follows an SRST2-compatible output format. Moreover, no ambiguity or variant sign ("?" and "*", respectively) is present in the PAM. We recommend to use PAMmaker to create genetic PAMs.
Function importGeneticPAM carries out the same procedure as importAllelicPAM because they share the same structure except argument names. Therefore, users may see Section 4.2 for details.
A list of three elements is returned by importGeneticPAM:
Function importPhysicalDists reads a table of physical distances into R. The table can be created from the Bandage output using a helper pipeline APDtools. The table must have eight columns of names "query1", "query2", "sample", "distance", "node_number", "source", "orientation" and "distance_path".
The processing of data in the function importPhysicalDists is simpler than the other three functions for data input. It only excludes distances from isolates that are not included in the parameter ingroup or included in the parameter outgroup.
Function importPhysicalDists returns a data frame of eight columns.
The major user interface of GeneMates, the function findPhysLink, returns a large named list whose elements are explained as follows.
tree = NULL
or the tree specified by users with this parameter). This list saves results from the tests of structural random effects in fitted LMMs. Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.