pangenome | R Documentation |
Computes the pangenome of a set of related organisms. Takes a list of gff3 files, a path to Pfam HMM models and data associated to those models.
It first uses hmmsearch (HMMER 3) to look for Pfam domains in the aminoacid sequences and, where it finds them, creates a domain profile. Then clusterize sequences with the same domain structure in protein families.
Those proteins without any domain identified are compared with phmmer (HMMER 3) and clustered with mcl into protein families.
Protein families obtained from both two previous steps are then splited into true orthologues by a gene tree-prunning algorithm, creating a pangenome profile of the group of organisms.
pangenome( gffs, hmm_pfam, dat_pfam, n_threads = 1, minhash_split = FALSE, group_prefix = "group", sep = "___", verbose = TRUE )
gffs |
A |
hmm_pfam |
|
dat_pfam |
|
n_threads |
|
minhash_split |
|
group_prefix |
|
sep |
|
verbose |
|
A scan against Pfam-A database is performed and only those hits of PF class 'domain' or 'family' are considered for further analysis. If two or more domains of the same clan overlap, then the one with the smaller e-value is kept. For those proteins which have Pfam hits, a profile of each protein is determined by the consecutive sequence of Pfam domains, and those proteins with the same domain structure are then clustered together to form protein families.
Phmmer is then used to compare all those proteins with no Pfam domains identified and the Markov Clustering algorithm (mcl) clusters them into coarse groups (i.e. protein families).
As this package is designed to handle groups of organisms related at a greater level than species (although it works perfectly fine with close related clades), a paralogue split based on conserved neighbourhood was not considered because this relationships get lose at distantly related organisms. Instead of that it relies on the concept of orthology itself to diferenciate between paralogues and true orthologues, this is, a gene tree based inference.
With each protein familiy groups which contain paralogues, an alignment followed by all-vs-all distance calculation is performed. To align DNA sequences, function AlignTranslation (DECIPHER package) is used. A neighbour-joining tree is then generated and midpoint rooted. If recent paralogues are detected (this is, nodes which all its descendants belong to the same organism), then all but one tips are removed from the gene-tree and saved. Finally, the gene-tree is splited in as many sub-trees as necessary to have the minimum set of sub-trees with just one gene of each organism. Paralogues are then assigned to the cluster with the copheneticaly closer tip.
In that way true-orthologue clusters are retrieved and saved in an object of class "PgR6MS".
An object of class PgR6MS (pagoo package). For details
of this type of object, consult help('PgR6MS')
, or go to
https://github.com/iferres/pagoo for a tutorial showing how to use this
class. A bunch of statistical and visualization methods are provided, as well
as examples of pangenome manipulation, and how to use it with popular third
party R packages to perform comparative genomics analyses, infer phylogenies,
or create highly customized figures.
External dependencies are HMMER3 and MCL, without them, running the function will stop with an error.
Ignacio Ferres and Gregorio Iraola
## Not run: setwd(tempdir()) # Download Pfam-A.hmm pfam_hmm_url <- 'http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz' pfam_hmm_gz <- 'Pfam-A.hmm.gz' download.file(url = pfam_hmm_url, destfile = pfam_hmm_gz, method = 'wget') # Download Pfam-A.hmm.dat pfam_dat_url <- 'http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz' pfam_dat_gz <- 'Pfam-A.hmm.dat.gz' download.file(url = pfam_dat_url, destfile = pfam_dat_gz, method = 'wget') # Uncompress those files library(R.utils) gunzip(pfam_hmm_gz) gunzip(pfam_dat_gz) pfam_hmm <- sub('[.]gz$', '', pfam_hmm_gz) pfam_dat <- sub('[.]gz$', '', pfam_dat_gz) # Extract pewit's example data example_data <- system.file('extdata', 'Hinfluenzae.tar.gz', package = 'pewit') untar(tarfile = example_data) gffs <- list.files(pattern = '[.]gff$') # Run pewit # note: You must have HMMER 3 and MCL installed ! library(pewit) pg <- pangenome(gffs = gffs, hmm_pfam = pfam_hmm, dat_pfam = pfam_dat, n_threads = 1) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.