knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
To follow this tutorial you will need to have the following libraries installed. Please feel free to run the following cell to install them if they are not yet installed.
if (!require(Biostrings)) install.packages('Biostrings') if (!require(ggplot2)) install.packages('tidyverse') if (!require(tidyverse)) install.packages('tidyverse') if (!require(broom)) install.packages('broom') library(utilsHMMER) library(Biostrings) library(tidyverse) library(ggplot2) library(broom)
The objective of this article is to show the main functionalities of the utilsHMMER
library using a practical case. We will search for homologous sequences for the whole SARS-CoV-2 reference proteome.
First, let's download the reference proteome for the Sars_CoV-2. To do this, we will read the compressed file directly from UniProt with the readAAStringSet
function from the Biostrings
library.
url <- paste0("https://ftp.uniprot.org/pub/databases/uniprot/current_release/", "knowledgebase/reference_proteomes/Viruses/UP000000354/UP000000354_694009.fasta.gz") sars.cov.fasta <- readAAStringSet(url)
Now, sars.cov.fasta
is an object of class AAStringSet
. We can inspect the object by printing it. So we can see that there are 15 protein sequences, the length of each sequence (parameter width
), the beginning and end of the sequence (parameter seq) and the name of each (parameter
names`).
sars.cov.fasta
Let's look at the headers of the different fasta sequences:
names(sars.cov.fasta)
library(kableExtra) tibble("names" = names(sars.cov.fasta)) %>% kbl()%>% kable_paper("basic", full_width = F)
We can access each of the individual sequences using the $
or [
operator. However, right now the names identifying the sequences are too long and somewhat inconvenient to work with. We can solve this by applying egular expressions
names(sars.cov.fasta) <- names(sars.cov.fasta) %>% str_extract("(?<=\\|)\\w+(?=\\s)")
Note that we have used the pipe operator to pass the result of the names
function to the str_extract
function and we have overwritten the names attribute using the same names
function. Let's see what the names look like now:
names(sars.cov.fasta)
Let's see different ways to extract a sequence from the AAStringSet
object.
sars.cov.fasta[["SPIKE_SARS"]] sars.cov.fasta$NCAP_SARS sars.cov.fasta[[1]]
Check that, when we extract a set of sequences, the result is not an AAString object but an AAStringSet.
sars.cov.fasta[1:3]
Finally, before performing a search with HMMER, it will be necessary to filter out those sequences longer than 5000 aa, since HMMER imposes such a restriction. For this we will use the width
function.
sars.cov.fasta <- sars.cov.fasta[which(width(sars.cov.fasta)<5000)]
Since we want to perform the search with a single sequence (i.e. one search for each sequence of the reference proteome) we will use phmmer
. Internally, HMMER builds a profile from your single query sequence. First of all, we will store in a character vector the databases we are interested in, which are SwissProt, Reference Proteomes and Ensembl.
databases <- c("swissprot", "uniprotrefprot", "ensembl")
By executing the next cell, we will be posting to the HMMER server a request with each sequence and database specified. Also, since we have selected fullseqfasta = TRUE
and alignment = TRUE
we will download the fasta files with the complete sequences found and the aligned domains. The timeout
and verbose
arguments refer to the maximum time to wait without receiving a response and whether to print the status of requests on the screen, respectively. This will take a while and will depend on the state of the HMMER server. Pour yourself a cup of coffee :)
HMMER_data_tbl <- search_phmmer( seqs = sars.cov.fasta, dbs = databases, fullseqfasta = TRUE, verbose = TRUE, timeout = 90, alignment = TRUE)
#saveRDS(HMMER_data_tbl, "HMMER_data_tbl_Sars_CoV.Rds") HMMER_data_tbl <- readRDS(system.file("extdata/HMMER_data_tbl_Sars_CoV.Rds", package = "utilsHMMER"))
Let us inspect the contents of HMMER_data_tbl
. It is a nested DataFrame of class HMMER_data_tbl
.
class(HMMER_data_tbl)
It has as many rows as searches have been done and 8 columns.
dim(HMMER_data_tbl)
Let's see what their classes are in the following table.
HMMER_data_tbl %>% summarize_all(funs(paste0(class(.), collapse = ", "))) %>% kbl() %>% kable_paper("basic", full_width = F)
The seqs
and dbs
columns contain the values used for the search (sequence and target databse). The urls
column contains the URLs where HMMER temporarily stores the results. Feel free to access the web page and make use of the web interface to inspect the results. For now, we have automated manually posting the sequences one by one and downloading the files of interest. The rest of the columns contain the search results themselves. However, we will not yet use the information contained in these.
We may be interested to see the alignment produced by HMMER of the domains it has found. To do this, we can make use of the visualize_MSA_using_ggmsa
function. However, for large numbers of sequences and long sequences it is preferable to use specific software.
visualize_MSA_using_ggmsa( msa = HMMER_data_tbl$alignment$R1A_SARS, start = 1, end = 100, color_scheme = "Clustal", max_positions_per_row = 50, alignment_index = 1:12 )
visualize_MSA_using_ggmsa( msa = HMMER_data_tbl$alignment$R1A_SARS, start = 1, end = 100, color_scheme = "Clustal", max_positions_per_row = 50, alignment_index = 1:12 )
We are going to slightly modify the DataFrame to make it easier to handle later. We will create a variable consisting of the sequence name and the database to identify each row and filter out those results that did not return results (and therefore have an NA in the url
column).
HMMER_data_tbl <- HMMER_data_tbl %>% mutate(names = names(url), id = paste(names, toupper(dbs), sep = "|")) %>% filter(as.character(url) != "NA")
Now, we can use the extract_from_HMMER_data_tbl
function to convert these results into a tidy DataFrame that we can easily work with. Let's see how we could do it only for the results referring to the SPIKE_SARS
protein:
df_SPIKE_SARS <- HMMER_data_tbl %>% filter(names == "SPIKE_SARS")%>% extract_from_HMMER_data_tbl( id_column = .$id)
Now, instead of having a nested DataFrame we have a DataFrame with all the information contained in a single table that has as many rows as domains have been identified (a potentially homologous sequence may contain one or more potentially homologous domains).
However, the extract_from_HMMER_data_tbl
function has other functionality that allows a number of common tasks to be performed directly. We are going to overwrite the df_SPIKE_SARS object, but now applying them. We will keep only sequences with an evalue < 0.01, eliminate redundant sequences, taxonomically annotate the sequences using a local database, and calculate a series of theoretical physical-chemical descriptors based on the primary sequence.
To see the meaning of each one of the variables and all the possibilities of use of the function I invite you to consult the specific documentation of the function with ?extract_from_HMMER_data_tbl
.
df_SPIKE_SARS <- HMMER_data_tbl %>% filter(names == "SPIKE_SARS")%>% extract_from_HMMER_data_tbl( id_column = .$id, evalue = 0.01, remove_redundancy = TRUE, annotate_taxonomic = "local", calculate_physicochemical_properties = TRUE )
Let's see the dimensions that df_SPIKE_SARS has now. We have 119 non-redundant homologous sequences and a total of 127 variables for these.
dim(df_SPIKE_SARS)
Let's take a look at some of these columns. However, for a detailed and complete list, we refer you to ?extract_from_HMMER_data_tbl
.
library(magrittr) tibble( "hits" = colnames(df_SPIKE_SARS) %>% str_subset("hits.") %>% extract(1:18), "domains" = colnames(df_SPIKE_SARS) %>% str_subset("domains.")%>% extract(1:18), "taxa" = colnames(df_SPIKE_SARS) %>% str_subset("taxa.")%>% extract(1:18), "properties" = colnames(df_SPIKE_SARS) %>% str_subset("properties.")%>% extract(1:18), ) %>% kbl() %>% kable_paper() %>% scroll_box(height = "200px")
As you can imagine, variables are grouped with their type because they all start the same, such as taxa.
or hits.
. This will allow us to use the starts_with
function to select them all at once.
We can now write the results in a csv by executing the following cell:
write_csv(df_SPIKE_SARS, "df_SPIKE_SARS.csv")
Note that, although we are going to continue only with the sequences homologous to SPIKE_SARS, we could carry out the same process with homologous sequences of different sequences. This can be useful when we would like to work with the sequences homologous to a series of sequences individually. Let's see an example:
df_Accessory_proteins <- HMMER_data_tbl %>% filter(names %in% c("NS8A_SARS", "NS8B_SARS"))%>% extract_from_HMMER_data_tbl( id_column = .$id)
To explore the different parameters that tell us about the significance of our sequences, we can use the summary
and plot
methods.
summary(df_SPIKE_SARS)
df_SPIKE_SARS %>% summary() %>% kbl() %>% kable_paper("basic", full_width = F)
As we can see, although all the sequences are significant for an evalue < 0.01, some of their domains are not. This is a potential red flag. Weak hits, none of which are good enough on their own, are summing up to lift the sequence up to a high score. This may mean that the sequence contains several weak homologous domains, or it might contains a repetitive sequence that is hitting by chance (i.e. once one repeat hits, all the repeats hit).
We can visualize it more easily using the `plot
method. Let's use the evalue
parameter to draw a line at 0.001 and see how our sequences are distributed for a more restrictive evalue.
plot(df_SPIKE_SARS, evalue = 0.001)
The image shows in green the E-value of the complete sequences and in red the E-value of each of their domains. The best domain and the complete sequence are joined by a black line. A feature of interest is that, since what the method returns is a ggplot, we can customize the resulting image. Thus, we will use plot
when we want to print the default figure and autoplot
when we want to edit it using ggplot2.
Here's an example where we filter out sequences that don't have an associated taxonomic genus and use fct_lump
to group less common taxonomic genus and domain architecture categories into a generic factor "Other". We also use str_wrap
so that the architecture is formatted in paragraphs.
df_SPIKE_SARS %>% filter(!is.na(taxa.genus))%>% mutate( genus = fct_lump(taxa.genus, n = 2), arch = str_wrap(fct_lump(hits.arch, n = 2), 20) ) %>% autoplot(evalue = 0.001) + facet_grid(arch~genus)
In addition, another option implemented is to calculate the percentage of sequence identity between sequences. We will choose a "global" type of pairwise alignment. We will calculate the sequence identity as $100 \cdot \frac{\text{identical pos}}{\text{aligned pos} \cdot \text{internal gap pos}}$ , i.e. "PID1", and, to speed up the process, we will make use of the option to parallelize the calculation.
pairwise_identities <- pairwise_alignment_sequence_identity( seqs = df_SPIKE_SARS$hits.fullseqfasta, aln_type = "global", pid_type = "PID1", allow_parallelization = "multisession" )
#saveRDS(pairwise_identities, "spike_pairwise_identities.RDS") pairwise_identities <- readRDS(system.file( "extdata/spike_pairwise_identities.RDS", package = "utilsHMMER") )
Now, we can see the distribution of the percent identity of the sequences, using the plot method.
plot(pairwise_identities)
In addition, it is possible to create an annotated heatmap by specifying the option `type = "heatmap" and the annotation to be used.
plot(pairwise_identities, type = "heatmap", annotation = df_SPIKE_SARS$taxa.genus)
Now, we are going to show how the physicochemical descriptors calculated can be used by doing a PCA. To do this, we will select the variables beginning with "properties", scale the data and perform the PCA with the prcomp function.
pca_descriptors <- df_SPIKE_SARS %>% select(starts_with("properties")) %>% scale() %>% prcomp()
Let's explore the results. For this we will use the broom
library, which converts R statistical analysis objects into tidy tibble. Let's start by looking at the variance explained by each component. We will extract the matrix containing the eigenvalues, keep the first n components that explain 99% of the variance
pca_descriptors %>% tidy(matrix = "eigenvalues") %>% filter(cumsum(percent) < 0.99)%>% ggplot(aes(PC, percent)) + geom_col(fill = "#56B4E9", alpha = 0.8) + scale_x_continuous(breaks = 1:10)
Now, let's plot the 3 principal components of the PCA. Since we want to be able to color the points based on taxonomic variables from the original set, we will combine both datasets using the augment() function.
pca_descriptors %>% augment(df_SPIKE_SARS) %>% ggplot(aes(.fittedPC1, .fittedPC2, color = taxa.family)) + geom_point(size = 1.5)
Finally, we will see which variables contribute most to explaining each of the components. To do this, we will extract the rotation matrix.
pca_descriptors %>% tidy(matrix = "rotation") %>% pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>% mutate(column = gsub(".*\\.", "", column))%>% ggplot(aes(PC1, PC2)) + geom_segment(xend = 0, yend = 0, arrow = arrow( ends = "first", length = grid::unit(4, "pt") ) ) + geom_text( aes(label = column), hjust = 0.2, nudge_x = -0.05, color = "#904C2F" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.