knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The gRodon
package allows you to predict the maximal growth rate of
a prokaryotic organism from genomic data. For details of how this
is done please see the original paper. Briefly, gRodon exploits codon usage
statistics to detect optimization of highly-expressed genes, an indicator of
selection for rapid growth. We will provide some guidance about when these predictions are
appropriate below.
The gRodon
package is quite simple, with only a single function currently
available to users: predictGrowth()
. We provide a simple, easy to
use interface that allows any user with a genome in-hand to predict maximal
growth rate.
The gRodon
approach is largely based on an earlier program, growthpred
,
developed by Vieira-Silva et al..
gRodon
is intended to be:
1. User-friendly (especially for R users)
2. More accurate (due to the incorporation of additional measures of codon usage)
3. More accurate on mixed-organism samples (i.e., metagenomes; due to a correction factor for the relative abundance of different organisms)
The easiest way to install gRodon is with devtools:
devtools::install_github("jlw-ecoevo/gRodon")
gRodon
has a few dependencies - namely the Biostrings
, coRdon
, and matrixStats
packages which are bioconductor packages and cannot be installed via CRAN.
To install them run the following:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Biostrings") BiocManager::install("coRdon") install.packages("matrixStats")
There are two things that gRodon
does not do that you must do ahead of
time yourself in order to predict growth rate:
1. You must identify coding sequences in your genome
(e.g., using prodigal)
2. You must annotate a set of highly-expressed genes, typically those coding for ribosomal
proteins. We usually use genes coding for ribosomal proteins as our highly-expressed set since
these are universal across microbes and almost always highly expressed (if you wanna
grow you gotta make proteins), but if you have expression data for your organism
you might be able to tailor an even better set of highly expressed genes.
It is helpful if (in frame) coding sequences are stored in a fasta file for easy loading. Additionally, you will need to have a logical vector describing which genes are part of your highly expressed set (this should become a bit clearer with the examples below).
I'll lead you through one way to get annotated coding sequence to input into gRodon here, but there are many (MANY) ways to do this. For example, check out this set of tutorials on functional annotation.
Disclaimer: I make no claim that this is the best, most efficient, or most graceful way to get annotated coding sequences from your assembly, but it will get the job done. If you have strong opinions about how this should be done shoot me an email and I will be happy to link to your rant/tutorial. If you already have annotated coding sequences please skip this section. If you have unannotated CDS (e.g., from prodigal), you might consider simply BLASTing your output for highly expressed genes (e.g., using the blastdb of ribosomal proteins included in growthpred).
Given an un-annotated genome or metagenomic assembly, an easy way to find and annotate coding sequence is using the program prokka. For example, if I have a set of assembled contigs, I could run (on the command line):
prokka contig.fasta --prefix my_genome
This will give several output files. You are interested the untranslated coding sequences (CDS), which
can be found in the *.ffn
file along with other non-coding transcripts (e.g., rRNA, tRNA). To get the names
of just your CDS you can pull them out of the *.gff
output:
sed -n '/##FASTA/q;p' my_genome/my_genome.gff | awk '$3=="CDS"' | awk '{print $9'} | awk 'gsub(";.*","")' | awk 'gsub("ID=","")' > CDS_names.txt
The above command first trims the sequence data off the end of the *.gff
file, then looks for
rows describing CDS and pulls out their IDs, finally it outputs these IDs to
a file called CDS_names.txt
.
# Load your *.ffn file into R genes <- readDNAStringSet("my_genome/my_genome.ffn") # Subset your sequences to those that code for proteins CDS_IDs <- readLines("CDS_names.txt") gene_IDs <- gsub(" .*","",names(genes)) #Just look at first part of name before the space genes <- genes[gene_IDs %in% CDS_IDs] #Search for genes annotated as ribosomal proteins highly_expressed <- grepl("ribosomal protein",names(genes),ignore.case = T)
You are now ready to run gRodon.
We have included an example genome assembly with this package, downloaded from NCBI's RefSeq database (Streptococcus pyogenes M1, GCF_000349925.2). This is simply a fasta file with the predicted coding sequences and annotations for this genome (provided by NCBI).
Let's load this file using the readDNAStringSet()
function from the
Biostrings
package (required for gRodon
to work).
library(gRodon) library(Biostrings)
path_to_genome <- system.file('extdata', 'GCF_000349925.2_ASM34992v2_cds_from_genomic.fna.gz', package = 'gRodon') genes <- readDNAStringSet(path_to_genome)
We also need a set of highly expressed genes. In general, the ribosomal
proteins are a good set of genes to use for this purpose. Since these proteins
should already be annotated in our example file (try running names(genes)
to
see the annotations from NCBI) we can use grep()
to search for them
(specifically, grepl()
to return a logical vector).
highly_expressed <- grepl("ribosomal protein",names(genes),ignore.case = T)
And now we are ready to predict the maximal growth rate of S. pyogenes M1.
predictGrowth(genes, highly_expressed)
The warning at the top is letting you know that there were a number of genes in your fasta that either weren't a length multiple of three, or were under the length threshold of 80 codons (240 bp) where estimates of bias will be unreliable. Don't worry about this unless the number of filtered genes is very high compared to the total number of genes in your file. We expect some genes to typically get filtered.
The output contains several (hopefully) useful quantities:
$CUBHE
is the mean of the codon usage bias of each highly expressed gene relative to
all other genes$ConsistencyHE
is the mean of the codon usage bias of each highly expressed gene relative to
all other highly expressed genes$CPB
is the genome-wide codon pair bias$FilteredSequences
is the number of sequences filtered due to length$d
is the estimated minimal doubling time from gRodon
in hours. This is, presumably, what
you came here for.$LowerCI
and $UpperCI
are the 95% confidence intervals for d
For details on how each of these is calculated please see the gRodon
paper. For
most users the only outputs that will matter much to you are $d
, $LowerCI
, and $UpperCI
, which are your
estimated minimal doubling times and 95% CIs output by the gRodon model.
For the most part, that's all you need to know. There are a few specific use
cases we discuss below that require a slightly more thoughful application of
gRodon
. These are:
What if you don't have a nice, complete genome ... can you still use gRodon
?
Yes, you can, though you may want to use either "partial" or "metagenome" mode.
When your genome is incomplete, you may have insufficient data to accurately
estimate the codon pair bias (since there are many possible codon pairs). In
this case you can set gRodon
to "partial" mode, which excludes pair-bias
from the prediction. This is probably a good choice when working with incomplete
SAGs and MAGs. The expected decrease in accuracy is quite small (see
gRodon
paper).
predictGrowth(genes, highly_expressed, mode="partial")
For bulk estimation of the average community-wide maximal growth rate from metagenomes, our consistency statistic is not appropriate, since different organisms may prefer different codons (even when they have similar bias values). It also doesn't make much sense to calculate the pair-bias in this scenario. Thus the model for predicting the average maximal growth rate of a metagenomic sample excludes consistency and pair-bias. Importantly, metagenome mode is expected to be less accurate than the default mode, so only use this mode if you must (i.e., you have a metagenomic sample).
EVEN MORE IMPORTANTLY - if you have a metagenome (mixed community sample) you MUST use metagenome mode. Otherwise, gRodon's consistency statistic will think you have a single organism that has a bunch of highly expressed genes optimized in different ways, and gRodon will severely underestimate your maximal growth rate. If you are getting really long minimal doubling times check to make sure you are using metagenome mode.
Let's try this on one of the samples from Okie et al., as discussed in the
gRodon
manuscript:
path_to_metagenome <- system.file('extdata', 'ERR2143764_fastp_prokka_scaffolds.ffn.gz', package = 'gRodon') genes <- readDNAStringSet(path_to_metagenome) highly_expressed <- grepl("ribosomal protein",names(genes),ignore.case = T) predictGrowth(genes, highly_expressed, mode = "metagenome")
You may also want to incorporate a correction for the relative abundance of different
organisms in this calculation (e.g., weighted metagenome mode) by specifying the depth_of_coverage
option. This should lead to more realistic estimates of the average community-wide maximal growth rate.
To use this correction provide depth_of_coverage
as a vector of mean depths of coverage for each gene.
It does not matter if these coverages are normalized since we only care about relative coverage.
First let's take a look at our read depths:
path_to_coverage <- system.file('extdata', 'ERR2143764_fastp_map2ffn_counts.tsv', package = 'gRodon') read_depths <- read.delim(path_to_coverage, stringsAsFactors = FALSE) depths <- read_depths$meandepth #Make sure in the correct order names(depths) <- read_depths$X.rname depth_of_coverage <- depths[gsub(" .*", "", names(genes))] head(depth_of_coverage)
And now let's run weighted metagenome mode:
predictGrowth(genes, highly_expressed, mode = "metagenome", depth_of_coverage = depth_of_coverage)
All codon-usage based predictors of maximal growth will underestimate minimal doubling times when these doubling times are very long. In other words, these predictors will often indicate that slow-growing microbes are able to grow more quickly than they actually can. See the original paper for a more in-depth discusison, but this is because codon-usage bias tends to plateau at a lower-bound in slow-growers.
What does this mean for you? Well, gRodon
can reliably tell you if your
microbe grows slowly, but it can't tell you how slowly. In practice, minimal doubling
time predictions appear to be accurate up to 5 hours. If gRodon
predicts a
doubling time above 5 hours, you can confidently say your microbe grows slowly,
but not quite how slowly (minimal doubling times >5hrs tend to be underestimates).
In these cases gRodon
will even warn you to be careful:
path_to_genome <- system.file('extdata', 'GCF_003253775.1_ASM325377v1_cds_from_genomic.fna.gz', package = 'gRodon') genes <- readDNAStringSet(path_to_genome) highly_expressed <- grepl("ribosomal protein",names(genes),ignore.case = T) predictGrowth(genes, highly_expressed)
The example above is for a Mycobacterium leprae genome. This organism has a doubling time around 240 hours. So we are off by an order of magnitude, but still well within the "slow grower" range.
Vieira-Silva et al. showed that
codon-usage based maximal growth estimators are sensitive to growth temperature.
Thermophiles and psychrophiles showed consistently higher/lower growth rates
respectively. We included a temperature correstion factor in gRodon
to account
for this. We caution that this correction is based on only 31 extremophile
species, and that gRodon
has primarily been validated on
mesophilic organisms.
predictGrowth(genes, highly_expressed, temperature = 33)
If working with thermophiles or psychrophiles you might consider using gRodon
's
alternative training set (below) which includes more extremophiles than
the default training set.
We also include a version of gRodon
trained on an alternative set of minimal
doubling time estimates from a trait database compiled by Madin et al.. This is a larger training
set than the set of minimal doubling times fron Vieira-Silva et al., though both training sets give
very similar results. See the gRodon
paper for an in-depth discussion.
path_to_genome <- system.file('extdata', 'GCF_000349925.2_ASM34992v2_cds_from_genomic.fna.gz', package = 'gRodon') genes <- readDNAStringSet(path_to_genome) highly_expressed <- grepl("ribosomal protein",names(genes),ignore.case = T) predictGrowth(genes, highly_expressed, training_set = "madin")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.