Creating pangenomes using FindMyFriends


Introduction

FindMyFriends is an R package for creating pangenomes of large sets of microbial organisms, together with tools to investigate the results. The number of tools to create pangenomes is large and some of the more notable ones are Roary, OrthoMCL, PanOCT, inParanoid and Sybil. Within the R framework micropan (available from CRAN) is currently the only other entry, but the scope of FindMyFriends is much broader.

Within comparative microbiol genomics, a pangenome is defined as a collection of all the genes present in a set of related organisms and grouped together by some sort of homology. This homology measure has mainly been some sort of blast derived similarity, but as you shall see here, this is not the only possibility.

Pangenomes have many uses, both as organisational tools and as a mean to get insight into the collective genomic information present within a group of organisms. An example of the former is to ensure an equal functional annotation of all genes within a set of genomes by annotating pangenome gene groups rather than single genes. An example of the latter is to use a pangenome together with phenotypic information to discover new links between genes and phenotypes.

Usually pangenomes are calculated from the result of blasting all genes in the organisms under investigation against each other. This step is very time-consuming due to the complexity of the BLAST algorithm and the need to compare everything with everything (resulting in a quadratic scaling). FindMyFriends generally takes a different approach by ditching BLAST altogether. In its absence FindMyFriends uses alignment free sequence comparisons, based on decomposition of the sequences into K-mer feature vectors. K-mers are words of equal length derived by sliding a window of size K over the sequence and a K-mer feature vector is simply the count of each unique word, as seen below

GATTCGATTAG  ->  ATT: 2
                 CGA: 1
                 GAT: 2
                 TAG: 1
                 TCG: 1
                 TTA: 1
                 TTC: 1

With this decomposition the problem of calculating sequence similarity is reduced to a vector similarity problem, for which there are many different solution. The one employed in FindMyFriends is called cosine similarity, which is effectively the cosine to the angle between the two vectors. If the vector space is strictly positive then the value is bound between 0 and 1 with 1 being 100 % similarity.

Apart from the use of K-mers over BLAST, FindMyFriends also advocate a different approach to how the grouping is performed. Normally the final (or close to final) grouping is derived directly from the main similarity comparison step. In FindMyFriends this is turned around, and the main similarity step is very coarse, resulting in very large gene groups, unsuited as a final result. These groups are then investigated in more detail with regards to the sequence similarity, neighborhood similarity and sequence length. This division makes it possible to employ a very fast, but not very precise algorithm for the first step, and only use time to investigate gene pairs with a high likelihood of being equal. In the end this approach results in a linear scaling with regards to the number of genes, as well as a general speedup, and to my knowledge FindMyFriends is currently by far the fastest algorithm for creating pangenomes.

FindMyFriends is flexible and there are many ways to go about creating pangenomes - you can even specify them manually making it possible to import results from other algorithms into the framework. This vignette will only focus on the currently recommended approach.

Data to use in this vignette

This vignette will use a collection of 10 different Mycoplasma pneumonia and hyopneumonia strains. These organisms have an appreciable small genome making it easier to distribute and analyse.

# Unpack files
location <- tempdir()
unzip(system.file('extdata', 'Mycoplasma.zip', package='FindMyFriends'),
      exdir=location)
genomeFiles <- list.files(location, full.names=TRUE, pattern='*.fasta')

Creating a Mycoplasma pangenome

Creating a Pangenome object from a set of fasta files

The first step in performing a pangenomic analysis with FindMyFriends is to load gene data into a Pangenome object. This is done with the aptly named pangenome function, which takes a list of fasta files, a boolean indicating whether the genes are translated and optionally information about the position of each gene on the chromosome and a boolean indicating whether to read all sequences into memory.

library(FindMyFriends)

mycoPan <- pangenome(genomeFiles[1:9], 
                     translated = TRUE, 
                     geneLocation = 'prodigal', 
                     lowMem = FALSE)
mycoPan

The last parameters of the function call warrants some more explanation. Some of the algorithms in FindMyFriends uses positional information and this must be supplied with the geneLocation parameter. This can be done in a number of ways: As a data.frame with one row per gene, as a function that takes the header info of each gene from the fasta file and returns a data.frame with on row per gene, or as a string identifying the annotator used to find the gene (currently only Prodigal is supported). The gene info has the following format (all columns required):

head(geneLocation(mycoPan))

The contig column holds information on the contig/chromosome on which the gene is located, the start and stop columns gives the start and stop position of translation and the strand column identifies on which strand the gene is located. Additional columns are allowed, but is currently unused.

The last parameter (lowMem) specifies whether all sequences should be read into memory (the default) or be kept as references to the original fasta files. The latter is useful in cases where the size of the sequence data becomes prohibitively large but is not needed for our little toy example.

Metadata about the organisms can be added and queried using the addOrgInfo method. For instance we might add some taxonomy data.

library(reutils)
orgMeta <- lapply(orgNames(mycoPan), function(name) {
    uid <- esearch(name, 'assembly')
    taxuid <- elink(uid, dbTo = 'taxonomy')
    reutils::content(esummary(taxuid), as = 'parsed')
})
orgMeta <- lapply(lapply(orgMeta, lapply, unlist), as.data.frame)
orgMeta <- do.call(rbind, orgMeta)

mycoPan <- addOrgInfo(mycoPan, orgMeta)
head(orgInfo(mycoPan))

The raw sequences are easily accessible as an XStringSet object, or split by organism into an XStringSetList object:

genes(mycoPan)
genes(mycoPan, split = 'organism')

Apart from that the object really needs some grouping before there is anything interesting to do with it.

Object defaults

A lot of the methods in FindMyFriends contains similar parameters, and a lot of these. To save typing when doing the analyses, these parameters can be set on a per-object manner. At creation a set of defaults is already set, which can be queried and modified:

# Query current defaults
head(defaults(mycoPan))

# Set a new default
defaults(mycoPan)$lowerLimit <- 0.6

Values supplied as arguments to a method will always override object defaults.

Calculating pangenomes

Pangenome calculation is generally split into two steps; one focusing on a coarse clustering of genes based on sequence similarity and one focusing on refining this clustering by comparing sequences and chromosomal neighborhood between genes in the clusters. Currently the fastest approach to the first step is based on the CD-Hit algorithm (cdhitGrouping), but other approaches exists (gpcGrouping and graphGrouping). There is no advantages to using the slower algorithms, so there use is generally not encouraged. cdhitGrouping works by repeatedly combining gene groups based on lower and lower similarity threshold. At each step the longest member of each gene group is chosen as a representative for the group in the following step. You generally want to set the lowest threshold rather low to ensure that genes belonging to the same group are definitely clustered together.

mycoPan <- cdhitGrouping(mycoPan, kmerSize = 5, cdhitIter = TRUE, nrep = 3)

Looking at our mycoPan object now reveals some additional information about the gene groups:

mycoPan

This is not the end of our analysis, as these gene groups needs to be refined. The refinement step is done with the neighborhoodSplit function (or kmerSplit if chromosomal position is unavailable). This function investigates each gene group and compares the member genes based on sequence similarity, chromosomal neighborhood similarity, sequence length and genome mebership, and uses this information to build up a weighted graph with the member genes as nodes. Edges reflect similarity and are only created if the two genes passes all thresholds with regards to the different comparisons. From this graph cliques (complete subgraphs i.e. subgraphs were all nodes are connected to each other) are extracted sequentially, starting with the clique containing the highest weighted edges. This ensures that all members of the resulting gene groups share a similarity with all other members and avoids the inclusion of genes being very similar to a few genes but not all. After this main step highly similar gene groups lying in parallel (sharing a gene group either up- or downstream) are merged together to create the final grouping.

mycoPan <- neighborhoodSplit(mycoPan, lowerLimit = 0.8)

As can be seen this step results in an increase in the number of gene groups:

mycoPan

In general you should expect that the use of FindMyFriends will result in a higher number of gene groups than other algorithms. This is due to its strict approach to defining similarity and in general you can expect the final result to be of very high quality.

Pangenome post-processing

Once the pangenome has been calculated it is possible to perform a range of different post-processing steps and analyses on results.

Paralogue linking

FindMyFriend forcefully avoids grouping genes from the same genome together. While it is generally a boon to split up paralogue gene groups, as the genes might have different functionality within the cell (and probably be under different regulation), it can be nice to maintain a link between groups with a certain similarity. Such a link can be obtained by calculating a kmer similarity between representatives from each gene group. Once paralogue links have been created they can be used when extracting raw sequences and the information will be taken into account when calculating organism statistics. Furthermore it is possible to merge paralogues into true gene groups (effectively undoing the neighborhood splitting).

mycoPan <- kmerLink(mycoPan, lowerLimit=0.8)

genes(mycoPan, split='paralogue')[[1]]

mycoPan
collapseParalogues(mycoPan, combineInfo='largest')

Removing genes

Usually genes are detected automatically by programs suchs as Prodigal and Glimmer, but while these programs are generally good, they are not perfect. A pangenome analysis can potentially reveal genes that, while annotated, are no longer functioning due to frameshifts etc. These will be apparant as members of gene groups that have vastly different length than the majority of the members. Removing such genes will in general improve whatever downstream analysis that the data will be used for. Using the removeGene() function it is possible to remove single or sets of genes from your pangenome object.

# Remove a gene by raw index
removeGene(mycoPan, ind=60)

# Remove the first organism by index
removeGene(mycoPan, organism=1)
# or by name
name <- orgNames(mycoPan)[1]
removeGene(mycoPan, organism=name)

# Remove the second member of the first gene group
removeGene(mycoPan, group=1, ind=2)

Investigating the results

Having a nice structure around your pangenome data is just a part of a great framework - having a set of analyses at hand for investigating the results is another part. Luckily, by being part of Bioconductor and using the standard data representations, a lot of the functionality already provided in Bioconductor is at your disposal. The two main data types that you might want to extract in order to do further analysis is the pangenome matrix and raw sequences:

# Get the pangenome matrix as an ExpressionSet object
as(mycoPan, 'ExpressionSet')
# or as a regular matrix
as(mycoPan, 'matrix')[1:6, ]

# Get all genes split into gene groups
genes(mycoPan, split='group')

Apart from what is already available within Bioconductor, FindMyFriends also comes with a set of tools to investigate the results further. Two of these calculates different statistics on the two main gene groupings in the dataset, namely gene groups and organisms. These functions are called groupStat()and orgStat() and are very straightforward to use:

groupStat(mycoPan)[[1]]

head(orgStat(mycoPan))

To get a very broad overview of your result you can use plotStat(). Besides that, three other plot functions exists to let you asses general properties of the pangenome. plotEvolution() lets you follow the number of singleton, accessory and core genes as the number of organisms increases. Generally this type of plot is very biased towards the order of organisms, so while it is possible to supply a progression of organisms, the default approach is to create bootstraped values for the plot:

plotEvolution(mycoPan)

plotSimilarity() creates a heatplot of the similarity matrix of the organisms in the pangenome. The similarity of two organisms is here defined as either the percent of shared genes or the cosine similarity of their total kmer feature vector. The ordering can be done manually or automatically according to a hierachcal clustering:

# Pangenome matrix similarity
plotSimilarity(mycoPan)
# Kmer similarity
plotSimilarity(mycoPan, type='kmer', kmerSize=5)
# No ordering
plotSimilarity(mycoPan, ordering='none')

plotTree() Plots a dendrogram of the organisms in the pangenome based either on the pangenome matrix or kmer counts. The tree can be augmented with additional information from either orgInfo or supplied manually:

plotTree(mycoPan, clust='ward.D2', dist='minkowski')
plotTree(mycoPan, type='kmer', kmerSize=5, clust='ward.D2', 
         dist='cosine', circular=TRUE, info='Species') + 
    ggplot2::scale_color_brewer(type='qual', palette=6)

Apart from these general statistics it is possible to get the neighborhood of any given gene group as a weighted graph to further investigate how the up- and downstream genes are organised.

library(igraph)
getNeighborhood(mycoPan, group=15, vicinity=5)

The resulting graph object can be plotted and handled by the functionality of the igraph package or plotted directly using FindMyFriends plotNeighborhood() method, which applies appropriate styling to the plot, making it easier to interpret:

plotNeighborhood(mycoPan, group=15, vicinity=5)

Panchromosomal analysis

While pangenomes are often envisioned as presence/absence matrices, this is not the only way to organise the information. As chromosomal position is available in most circumstances, the information can also be converted into a panchromosomal graph where each gene group is a vertex and chromosomal adjacency is weighted edges. The graph will generally be very sparse, consisting of long strings of gene groups interspersed with regions of higher variability.

pcGraph(mycoPan)

This graph structure can be used for a range of things and actually is the basis for the last part of the neighborhood splitting step (merging parallel gene groups). Often local regions of high variability can point to insertion/deletion events, frameshifts, problems with the gene grouping (god forbid) or general regions of high chromosomal plasticity. These regions can of course be detected in FindMyFriends and investigated accordingly.

localVar <- variableRegions(mycoPan, flankSize=6)
localVar[[1]]
plot(localVar[[1]]$graph)

Additional tools

What is explained above is just a small part of what is possible with FindMyFriends. There are tie-ins with other packages such as r Biocpkg("PanVizGenerator"), and support for functional annotation of gene groups, to name a few. Go explore!

Session

sessionInfo()


Try the FindMyFriends package in your browser

Any scripts or data that you put into this service are public.

FindMyFriends documentation built on Nov. 8, 2020, 6:46 p.m.