README.md

GeneMates: an R package identifying horizontal gene co-transfer between bacteria

GeneMates is an R package implementing a network approach to identify horizontal gene co-transfer (HGcoT) between bacteria using whole-genome sequencing (WGS) data. It is particularly useful for investigating intra-species HGcoT, where presence-absence status of acquired genes is usually confounded by bacterial population structure due to clonal reproduction.

Citation

We would appreciate your citation if you find GeneMates or its helper scripts helpful to your project:

This project was supported by the Department of Biochemistry and Molecular Biology, University of Melbourne, Victoria, Australia.

Table of Contents

  1. Installation
  2. Quick start
  3. Function hierarchy
  4. Inputs
  5. Output
  6. References

1. Installation

The latest stable release is v0.2.2 (21 March 2020). You may find previous versions on the Releases page.

There are two approaches to install this package. Take GeneMates v0.2.2 for example. Assuming that we are going to install the package under a user-specified directory Lib:

R

install.packages(pkgs = "GeneMates_0.2.2.tar.gz", lib = "Lib")

bash

The program Rscript should be accessible as a command. Namely, the path of R should be added to $PATH before hand.

./install_GeneMates.sh GeneMates_0.2.2.tar.gz Lib 

1.1. Dependencies

Software

R packages

Since dependent R packages are not maintained by the GeneMates team and R may not install them automatically as you install GeneMates, you may follow manuals of these packages to complete the installation.

1.2. Helper program and scripts

We have developed the following scripts to help users to prepare input data for GeneMates as well as to validate results:

2. Quick start

Function findPhysLink is a pivotal function and a major user interface of this package. Here we show a typical procedure for the usage of GeneMates. We assume this procedure is run on a computing server.

setwd("Analysis")

library(GeneMates)

tr <- read.tree("rooted.tree")  # a maximum-likelihood tree

assoc <- findPhysLink(snps = "Analysis/snps.csv", snps.delim = ",", pos.col = "Pos", ref.pos = "Ref", min.mac = 1, genetic.pam = "gpam.tsv", genetic.pam.delim = "\t", allelic.pam = "apam.tsv", allelic.pam.delim = "\t", mapping = NULL, min.count = 15, phys.dists = "prioritised_dists.tsv", dist.delim = "\t", max.node.num = 2, max.dist = 250e3, ref = "ref", tree = tr, min.co = 2, d.qs = c(0, 0.25, 0.5, 0.75, 1), max.p = 0.05, max.range = 2000, min.pIBD = 0.9, output.dir = "Output", prefix = "demo", gemma.path = "~/Apps/gemma", n.cores = 8, save.stages = TRUE, del.temp = FALSE, skip = TRUE)

snps <- assoc[["snps"]]  # a large list
saveRDS(snps, file = "Out/snps.rds")  # Analysis/Out/snps.rds

assoc <- assoc[-which(names(assoc) == "snps")]
saveRDS(assoc, file = "Out/assoc.rds")  # Analysis/Out/assoc.rds

The element "snps" in the result list is usually too large to be loaded to an R session when the sample size or SNP number is large. Therefore we recommend to save the result list in two files.

3. Function hierarchy

This section illustrates organisation of public functions in GeneMates.

3.1. Main function

findPhysLink: the main function of GeneMates. It has the following subordinate functions.

3.2. Network analysis

The following functions analyse the output network of findPhysLink.

3.3. Summary statistics of input data

3.4. Analysis of population structural

3.5. Identification of mobile gene clusters

3.6. Spatiotemporal analysis

GeneMates comprises functions analysing spatial and temporal distributions of bacterial isolates and alleles of genes of interest.

3.7. Visualisation

3.8. Helper functions

Functions under this category are developed for helping users to extract and inspect specific aspects of results from findPhysLink.

3.9. Alternative association analysis

4. Inputs

GeneMates takes as input four kinds of data for detection of HGcoT. A function is created for importing each kind of data. This section explain the usage of these four functions.

4.1. Importing core-genome SNP matrix

Input

Function importCoreGenomeSNPs reads a cgSNP table, encodes SNP genotypes, extracts biallelic SNPs and performs zero-centring of encoded SNP genotypes. The behaviour of this function is similar to the function get_SNP_data in R package BugWAS (see BUGWAS_modular.R). The function importCoreGenomeSNPs expects the SNP table to follow the output format of the script parseSNPtable.py in a read-mapping pipeline RedDog. Accordingly, the SNP table should be stored as a CSV file by default and an example of its structure is shown as follows.

| Pos,Ref,Isolate1,Isolate2,Isolate3,... | | -------------------------------------- | | 10,A,A,A,A,... | | 21,C,C,C,C,... | | 25,C,C,C,T,... | | ... |

Here, the hyphen "-" denotes the SNP site at the 21st base (Pos) of the reference genome "Ref" is not present in Isolate2.

Arguments (excluding those specifying output filenames)

Limitation

Because of the code

snps.var <- apply(snps.core, 2, function(x) length(unique(x)))
snps.bi <- snps.core[, as.integer(snps.var) == 2]
snp.alleles <- .getAllelePairs(snps.bi)
G <- .encodeAlleles(snps.bi, snp.alleles)

function importCoreGenomeSNPs only works correctly when all SNPs are detected in all ingroup genomes (namely, cgSNPs). In other words, missing SNP genotypes (each is denoted by a hyphen) create false genotype counts. It is not necessary to address this limitation for GeneMates because our estimation of bacterial population structure only relies on biallelic SNP sites that are found in all ingroup isolates.

Procedure

Output

Function importCoreGenomeSNPs returns a large list to R. Cautions must be taken to run this function due to the large size of the output list: users are advised to check their computer capacity in the first place. Elements of the output list are listed as follows.

4.2. Importing allelic presence-absence matrix

Input

GeneMates provides function importAllelicPAM to read an allelic PAM, which is a binary matrix denoting presence (1) and absence (0) of alleles of the genes of interest across bacterial isolates. The PAM can be created using a helper pipeline PAMmaker and it has the following structure.

| Sample | Allele 1 | Allele 2 | Allele_3 |... | | ----- | -----| ----- | ----- | ----- | | Isolate_1 | 1 | 1 | 0 | ... | | Isolate_2 | 1 | 0 | 0 | ... | | ... | ... | ... | ... | ... |

Note that only the first column name "Sample" is fixed in every allelic PAM.

Procedure

Users may use the R command ?importAllelicPAM to see an argument list with explanations. The function carries out steps as follows.

Output

Function importAllelicPAM returns a large list of seven elements.

4.3. Importing genetic presence-absence matrix

Input

Function importGeneticPAM reads a genetic PAM into R. A genetic PAM stores more information than its corresponding allelic PAM: it is a character matrix showing allele calls per gene of interest in bacterial isolates. Every genetic PAM has the following structure.

| Sample | Gene 1 | Gene 2 | Gene 3 | ... | | ----- | ----- | ----- | ----- | ----- | | Isolate 1 | Allele 1_1 | - | - | ... | | Isolate 2 | Allele 1_1 | Allele 2_1 | Allele 3_1 | ... | | Isolate 3 | - | Allele 2_2 | - | ... | | ... | ... | ... | ... | ... |

Note that only the first column name "Sample" is fixed in every genetic PAM. In addition, each hyphen denotes absence of any alleles of a gene in an isolate.

The function importGeneticPAM postulates that the genetic PAM follows an SRST2-compatible output format. Moreover, no ambiguity or variant sign ("?" and "*", respectively) is present in the PAM. We recommend to use PAMmaker to create genetic PAMs.

Procedure

Function importGeneticPAM carries out the same procedure as importAllelicPAM because they share the same structure except argument names. Therefore, users may see Section 4.2 for details.

Output

A list of three elements is returned by importGeneticPAM:

4.4. Importing physical distances between genomic loci

Input

Function importPhysicalDists reads a table of physical distances into R. The table can be created from the Bandage output using a helper pipeline APDtools. The table must have eight columns of names "query1", "query2", "sample", "distance", "node_number", "source", "orientation" and "distance_path".

Procedure

The processing of data in the function importPhysicalDists is simpler than the other three functions for data input. It only excludes distances from isolates that are not included in the parameter ingroup or included in the parameter outgroup.

Output

Function importPhysicalDists returns a data frame of eight columns.

5. Output

The major user interface of GeneMates, the function findPhysLink, returns a large named list whose elements are explained as follows.

References

  1. McVean, G. A Genealogical Interpretation of Principal Components Analysis. PLOS Genet. 5, e1000686 (2009).
  2. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet 44, 821–824 (2012).


wanyuac/GeneMates documentation built on Aug. 12, 2022, 7:37 a.m.