In currocam/taxa2hmmer: Get more out of homologous sequence search with HMMER

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Prerequisites

To follow this tutorial you will need to have the following libraries installed. Please feel free to run the following cell to install them if they are not yet installed.

if (!require(Biostrings)) install.packages('Biostrings')
if (!require(ggplot2)) install.packages('tidyverse')
if (!require(tidyverse)) install.packages('tidyverse')
if (!require(broom)) install.packages('broom')

library(utilsHMMER)
library(Biostrings)
library(tidyverse)
library(ggplot2)
library(broom)

Getting started

The objective of this article is to show the main functionalities of the utilsHMMER library using a practical case. We will search for homologous sequences for the whole SARS-CoV-2 reference proteome.

Post request to HMMER server

Download the SARS-CoV-2 reference proteome

First, let's download the reference proteome for the Sars_CoV-2. To do this, we will read the compressed file directly from UniProt with the readAAStringSet function from the Biostrings library.

url <- paste0("https://ftp.uniprot.org/pub/databases/uniprot/current_release/",
  "knowledgebase/reference_proteomes/Viruses/UP000000354/UP000000354_694009.fasta.gz")
sars.cov.fasta <- readAAStringSet(url)

Now, sars.cov.fasta is an object of class AAStringSet. We can inspect the object by printing it. So we can see that there are 15 protein sequences, the length of each sequence (parameter width), the beginning and end of the sequence (parameter seq) and the name of each (parameternames`).

sars.cov.fasta

Let's look at the headers of the different fasta sequences:

names(sars.cov.fasta)

library(kableExtra)
tibble("names" = names(sars.cov.fasta)) %>%
  kbl()%>%
  kable_paper("basic", full_width = F)

We can access each of the individual sequences using the $ or [ operator. However, right now the names identifying the sequences are too long and somewhat inconvenient to work with. We can solve this by applying egular expressions

names(sars.cov.fasta) <- names(sars.cov.fasta) %>%
  str_extract("(?<=\\|)\\w+(?=\\s)")

Note that we have used the pipe operator to pass the result of the names function to the str_extract function and we have overwritten the names attribute using the same names function. Let's see what the names look like now:

names(sars.cov.fasta)

Let's see different ways to extract a sequence from the AAStringSet object.

sars.cov.fasta[["SPIKE_SARS"]]
sars.cov.fasta$NCAP_SARS
sars.cov.fasta[[1]]

Check that, when we extract a set of sequences, the result is not an AAString object but an AAStringSet.

sars.cov.fasta[1:3]

Finally, before performing a search with HMMER, it will be necessary to filter out those sequences longer than 5000 aa, since HMMER imposes such a restriction. For this we will use the width function.

sars.cov.fasta <- sars.cov.fasta[which(width(sars.cov.fasta)<5000)]

Query's sequence

Since we want to perform the search with a single sequence (i.e. one search for each sequence of the reference proteome) we will use phmmer. Internally, HMMER builds a profile from your single query sequence. First of all, we will store in a character vector the databases we are interested in, which are SwissProt, Reference Proteomes and Ensembl.

databases <- c("swissprot", "uniprotrefprot", "ensembl")

By executing the next cell, we will be posting to the HMMER server a request with each sequence and database specified. Also, since we have selected fullseqfasta = TRUE and alignment = TRUE we will download the fasta files with the complete sequences found and the aligned domains. The timeout and verbose arguments refer to the maximum time to wait without receiving a response and whether to print the status of requests on the screen, respectively. This will take a while and will depend on the state of the HMMER server. Pour yourself a cup of coffee :)

HMMER_data_tbl <- search_phmmer(
  seqs = sars.cov.fasta,
  dbs = databases,
  fullseqfasta = TRUE,
  verbose = TRUE,
  timeout = 90,
  alignment = TRUE)

#saveRDS(HMMER_data_tbl, "HMMER_data_tbl_Sars_CoV.Rds")
HMMER_data_tbl <- readRDS(system.file("extdata/HMMER_data_tbl_Sars_CoV.Rds", package = "utilsHMMER"))

Let us inspect the contents of HMMER_data_tbl. It is a nested DataFrame of class HMMER_data_tbl.

class(HMMER_data_tbl)

It has as many rows as searches have been done and 8 columns.

dim(HMMER_data_tbl)

Let's see what their classes are in the following table.

HMMER_data_tbl %>%
  summarize_all(funs(paste0(class(.), collapse = ", "))) %>%
  kbl() %>%
  kable_paper("basic", full_width = F)

The seqs and dbs columns contain the values used for the search (sequence and target databse). The urls column contains the URLs where HMMER temporarily stores the results. Feel free to access the web page and make use of the web interface to inspect the results. For now, we have automated manually posting the sequences one by one and downloading the files of interest. The rest of the columns contain the search results themselves. However, we will not yet use the information contained in these.

Visualize MSA

We may be interested to see the alignment produced by HMMER of the domains it has found. To do this, we can make use of the visualize_MSA_using_ggmsa function. However, for large numbers of sequences and long sequences it is preferable to use specific software.

visualize_MSA_using_ggmsa(
  msa = HMMER_data_tbl$alignment$R1A_SARS,
  start = 1,
  end = 100,
  color_scheme = "Clustal",
  max_positions_per_row = 50,
  alignment_index = 1:12
  )

visualize_MSA_using_ggmsa(
  msa = HMMER_data_tbl$alignment$R1A_SARS,
  start = 1,
  end = 100,
  color_scheme = "Clustal",
  max_positions_per_row = 50,
  alignment_index = 1:12
  )

Preprocessing and wrangling

We are going to slightly modify the DataFrame to make it easier to handle later. We will create a variable consisting of the sequence name and the database to identify each row and filter out those results that did not return results (and therefore have an NA in the url column).

HMMER_data_tbl <- HMMER_data_tbl  %>%
  mutate(names = names(url),
         id = paste(names, toupper(dbs), sep = "|")) %>%
  filter(as.character(url) != "NA")

Extracting a Tidy Dataframe

Now, we can use the extract_from_HMMER_data_tbl function to convert these results into a tidy DataFrame that we can easily work with. Let's see how we could do it only for the results referring to the SPIKE_SARS protein:

df_SPIKE_SARS <- HMMER_data_tbl %>%
  filter(names == "SPIKE_SARS")%>%
  extract_from_HMMER_data_tbl(
    id_column = .$id)

Now, instead of having a nested DataFrame we have a DataFrame with all the information contained in a single table that has as many rows as domains have been identified (a potentially homologous sequence may contain one or more potentially homologous domains).

However, the extract_from_HMMER_data_tbl function has other functionality that allows a number of common tasks to be performed directly. We are going to overwrite the df_SPIKE_SARS object, but now applying them. We will keep only sequences with an evalue < 0.01, eliminate redundant sequences, taxonomically annotate the sequences using a local database, and calculate a series of theoretical physical-chemical descriptors based on the primary sequence.

To see the meaning of each one of the variables and all the possibilities of use of the function I invite you to consult the specific documentation of the function with ?extract_from_HMMER_data_tbl.

df_SPIKE_SARS <- HMMER_data_tbl %>%
  filter(names == "SPIKE_SARS")%>%
  extract_from_HMMER_data_tbl(
    id_column = .$id,
    evalue = 0.01,
    remove_redundancy = TRUE,
    annotate_taxonomic = "local",
    calculate_physicochemical_properties = TRUE
    )

Let's see the dimensions that df_SPIKE_SARS has now. We have 119 non-redundant homologous sequences and a total of 127 variables for these.

dim(df_SPIKE_SARS)

Let's take a look at some of these columns. However, for a detailed and complete list, we refer you to ?extract_from_HMMER_data_tbl.

library(magrittr)
tibble(
  "hits" = colnames(df_SPIKE_SARS) %>%
    str_subset("hits.") %>%
    extract(1:18),
  "domains" = colnames(df_SPIKE_SARS) %>%
    str_subset("domains.")%>%
    extract(1:18),
  "taxa" = colnames(df_SPIKE_SARS) %>%
    str_subset("taxa.")%>%
    extract(1:18),
  "properties" = colnames(df_SPIKE_SARS) %>%
    str_subset("properties.")%>%
    extract(1:18),
  ) %>%
    kbl() %>%
  kable_paper() %>%
  scroll_box(height = "200px")

As you can imagine, variables are grouped with their type because they all start the same, such as taxa. or hits.. This will allow us to use the starts_with function to select them all at once.

We can now write the results in a csv by executing the following cell:

write_csv(df_SPIKE_SARS, "df_SPIKE_SARS.csv")

Note that, although we are going to continue only with the sequences homologous to SPIKE_SARS, we could carry out the same process with homologous sequences of different sequences. This can be useful when we would like to work with the sequences homologous to a series of sequences individually. Let's see an example:

df_Accessory_proteins <- HMMER_data_tbl %>%
  filter(names %in% c("NS8A_SARS", "NS8B_SARS"))%>%
  extract_from_HMMER_data_tbl(
    id_column = .$id)

Data exploration

Significant sequences

To explore the different parameters that tell us about the significance of our sequences, we can use the summary and plot methods.

summary(df_SPIKE_SARS)

df_SPIKE_SARS %>%
  summary() %>%
  kbl() %>%
  kable_paper("basic", full_width = F)

As we can see, although all the sequences are significant for an evalue < 0.01, some of their domains are not. This is a potential red flag. Weak hits, none of which are good enough on their own, are summing up to lift the sequence up to a high score. This may mean that the sequence contains several weak homologous domains, or it might contains a repetitive sequence that is hitting by chance (i.e. once one repeat hits, all the repeats hit).

We can visualize it more easily using the `plot method. Let's use the evalue parameter to draw a line at 0.001 and see how our sequences are distributed for a more restrictive evalue.

plot(df_SPIKE_SARS, evalue = 0.001)

The image shows in green the E-value of the complete sequences and in red the E-value of each of their domains. The best domain and the complete sequence are joined by a black line. A feature of interest is that, since what the method returns is a ggplot, we can customize the resulting image. Thus, we will use plot when we want to print the default figure and autoplot when we want to edit it using ggplot2.

Here's an example where we filter out sequences that don't have an associated taxonomic genus and use fct_lump to group less common taxonomic genus and domain architecture categories into a generic factor "Other". We also use str_wrap so that the architecture is formatted in paragraphs.

df_SPIKE_SARS %>%
  filter(!is.na(taxa.genus))%>%
  mutate(
    genus = fct_lump(taxa.genus, n = 2),
    arch =  str_wrap(fct_lump(hits.arch, n = 2), 20)
  ) %>%
  autoplot(evalue = 0.001) + 
  facet_grid(arch~genus)

Percentage of sequence identity

In addition, another option implemented is to calculate the percentage of sequence identity between sequences. We will choose a "global" type of pairwise alignment. We will calculate the sequence identity as $100 \cdot \frac{\text{identical pos}}{\text{aligned pos} \cdot \text{internal gap pos}}$ , i.e. "PID1", and, to speed up the process, we will make use of the option to parallelize the calculation.

pairwise_identities <- pairwise_alignment_sequence_identity(
     seqs = df_SPIKE_SARS$hits.fullseqfasta,
     aln_type = "global",
     pid_type = "PID1",
     allow_parallelization = "multisession"
 )

#saveRDS(pairwise_identities, "spike_pairwise_identities.RDS")
pairwise_identities <- readRDS(system.file(
  "extdata/spike_pairwise_identities.RDS",
  package = "utilsHMMER")
  )

Now, we can see the distribution of the percent identity of the sequences, using the plot method.

plot(pairwise_identities)

In addition, it is possible to create an annotated heatmap by specifying the option `type = "heatmap" and the annotation to be used.

plot(pairwise_identities, type = "heatmap",
     annotation = df_SPIKE_SARS$taxa.genus)

Principal Components Analysis using

Now, we are going to show how the physicochemical descriptors calculated can be used by doing a PCA. To do this, we will select the variables beginning with "properties", scale the data and perform the PCA with the prcomp function.

pca_descriptors <- df_SPIKE_SARS %>% 
  select(starts_with("properties")) %>%
  scale() %>%
  prcomp()

Let's explore the results. For this we will use the broom library, which converts R statistical analysis objects into tidy tibble. Let's start by looking at the variance explained by each component. We will extract the matrix containing the eigenvalues, keep the first n components that explain 99% of the variance

pca_descriptors %>%
  tidy(matrix = "eigenvalues") %>%
  filter(cumsum(percent) < 0.99)%>%
  ggplot(aes(PC, percent)) +
  geom_col(fill = "#56B4E9", alpha = 0.8) +
  scale_x_continuous(breaks = 1:10)

Now, let's plot the 3 principal components of the PCA. Since we want to be able to color the points based on taxonomic variables from the original set, we will combine both datasets using the augment() function.

pca_descriptors %>%
  augment(df_SPIKE_SARS) %>%
  ggplot(aes(.fittedPC1, .fittedPC2, color = taxa.family)) + 
  geom_point(size = 1.5)

Finally, we will see which variables contribute most to explaining each of the components. To do this, we will extract the rotation matrix.

pca_descriptors %>%
  tidy(matrix = "rotation") %>%
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") %>%
  mutate(column = gsub(".*\\.", "", column))%>%
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow(
    ends = "first",
    length = grid::unit(4, "pt")
  )
               ) +
  geom_text(
    aes(label = column),
    hjust = 0.2, nudge_x = -0.05, 
    color = "#904C2F"
  )

currocam/taxa2hmmer documentation built on April 10, 2022, 11:02 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

currocam/taxa2hmmer
Get more out of homologous sequence search with HMMER

In currocam/taxa2hmmer: Get more out of homologous sequence search with HMMER

Prerequisites

Getting started

Post request to HMMER server

Download the SARS-CoV-2 reference proteome

Query's sequence

Visualize MSA

Preprocessing and wrangling

Extracting a Tidy Dataframe

Data exploration

Significant sequences

Percentage of sequence identity

Principal Components Analysis using

R Package Documentation

Browse R Packages

We want your feedback!

currocam/taxa2hmmer Get more out of homologous sequence search with HMMER

In currocam/taxa2hmmer: Get more out of homologous sequence search with HMMER

Prerequisites

Getting started

Post request to HMMER server

Download the SARS-CoV-2 reference proteome

Query's sequence

Visualize MSA

Preprocessing and wrangling

Extracting a Tidy Dataframe

Data exploration

Significant sequences

Percentage of sequence identity

Principal Components Analysis using

R Package Documentation

Browse R Packages

We want your feedback!

currocam/taxa2hmmer
Get more out of homologous sequence search with HMMER