pangenome: Compute the pangenome of a set of related organisms.

View source: R/main.R

pangenomeR Documentation

Compute the pangenome of a set of related organisms.

Description

Computes the pangenome of a set of related organisms. Takes a list of gff3 files, a path to Pfam HMM models and data associated to those models.

It first uses hmmsearch (HMMER 3) to look for Pfam domains in the aminoacid sequences and, where it finds them, creates a domain profile. Then clusterize sequences with the same domain structure in protein families.

Those proteins without any domain identified are compared with phmmer (HMMER 3) and clustered with mcl into protein families.

Protein families obtained from both two previous steps are then splited into true orthologues by a gene tree-prunning algorithm, creating a pangenome profile of the group of organisms.

Usage

pangenome(
  gffs,
  hmm_pfam,
  dat_pfam,
  n_threads = 1,
  minhash_split = FALSE,
  group_prefix = "group",
  sep = "___",
  verbose = TRUE
)

Arguments

gffs

A vector of gff3 file paths as retrieved by prokka (Seeman, 2014). These files must have a ".gff" extension.

hmm_pfam

character with the path to Pfam-A.hmm file. If any of hmmPfam or datPfam is left NULL, then the pfam step is skipped and sequences will be preclustered by phmmer and mcl.

dat_pfam

character with the path to Pfam-A.hmm.dat file. If any of hmm_pfam or dat_pfam is left NULL, then the pfam step is skipped and sequences will be preclustered by phmmer and mcl.

n_threads

integer. The number of threads to use.

minhash_split

logical. Whether to use fast minhash distantance calculation instead of alignment in precluster spliting step. Recomended when working with big datasets (> 50 genomes). Default is FALSE.

group_prefix

character. The prefix you want to use to name the cluster of orthologues. Default is "group". Cluster names will be formed by pasteing this prefix to a number identifiying each cluster, in the form [group_prefix]0001.

sep

character. A separator to form unique gene names using the name of the organism, in the form (if sep='___') organism___geneid. This is mostly used inside the function, although is used to form unique sequence names in the final output. Default is "___" (3 underscores), and should be changed if any of your input files or gene identifiers already contain this separator string, so it not conflicts the process.

verbose

logical. Whether display progress messages or not.

Details

A scan against Pfam-A database is performed and only those hits of PF class 'domain' or 'family' are considered for further analysis. If two or more domains of the same clan overlap, then the one with the smaller e-value is kept. For those proteins which have Pfam hits, a profile of each protein is determined by the consecutive sequence of Pfam domains, and those proteins with the same domain structure are then clustered together to form protein families.

Phmmer is then used to compare all those proteins with no Pfam domains identified and the Markov Clustering algorithm (mcl) clusters them into coarse groups (i.e. protein families).

As this package is designed to handle groups of organisms related at a greater level than species (although it works perfectly fine with close related clades), a paralogue split based on conserved neighbourhood was not considered because this relationships get lose at distantly related organisms. Instead of that it relies on the concept of orthology itself to diferenciate between paralogues and true orthologues, this is, a gene tree based inference.

With each protein familiy groups which contain paralogues, an alignment followed by all-vs-all distance calculation is performed. To align DNA sequences, function AlignTranslation (DECIPHER package) is used. A neighbour-joining tree is then generated and midpoint rooted. If recent paralogues are detected (this is, nodes which all its descendants belong to the same organism), then all but one tips are removed from the gene-tree and saved. Finally, the gene-tree is splited in as many sub-trees as necessary to have the minimum set of sub-trees with just one gene of each organism. Paralogues are then assigned to the cluster with the copheneticaly closer tip.

In that way true-orthologue clusters are retrieved and saved in an object of class "PgR6MS".

Value

An object of class PgR6MS (pagoo package). For details of this type of object, consult help('PgR6MS'), or go to https://github.com/iferres/pagoo for a tutorial showing how to use this class. A bunch of statistical and visualization methods are provided, as well as examples of pangenome manipulation, and how to use it with popular third party R packages to perform comparative genomics analyses, infer phylogenies, or create highly customized figures.

Note

External dependencies are HMMER3 and MCL, without them, running the function will stop with an error.

Author(s)

Ignacio Ferres and Gregorio Iraola

Examples

## Not run: 
setwd(tempdir())

# Download Pfam-A.hmm
pfam_hmm_url <- 'http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz'
pfam_hmm_gz <- 'Pfam-A.hmm.gz'
download.file(url = pfam_hmm_url, destfile = pfam_hmm_gz, method = 'wget')

# Download Pfam-A.hmm.dat
pfam_dat_url <- 'http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz'
pfam_dat_gz <- 'Pfam-A.hmm.dat.gz'
download.file(url = pfam_dat_url, destfile = pfam_dat_gz, method = 'wget')

# Uncompress those files
library(R.utils)
gunzip(pfam_hmm_gz)
gunzip(pfam_dat_gz)
pfam_hmm <- sub('[.]gz$', '', pfam_hmm_gz)
pfam_dat <- sub('[.]gz$', '', pfam_dat_gz)

# Extract pewit's example data
example_data <- system.file('extdata', 'Hinfluenzae.tar.gz', package = 'pewit')
untar(tarfile = example_data)
gffs <- list.files(pattern = '[.]gff$')

# Run pewit
# note: You must have HMMER 3 and MCL installed !
library(pewit)
pg <- pangenome(gffs = gffs, hmm_pfam = pfam_hmm, dat_pfam = pfam_dat, n_threads = 1)


## End(Not run)

iferres/pewit documentation built on June 22, 2022, 8:34 p.m.