knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The objective of toolkit4pySCA is to provide a set of functions that allow to obtain a sequence alignment using HMMER and to associate taxonomic information to it. Then, it allows to read the file generated by pySCA in R and to easily extract and analyze part of the information.
You can install the development version of toolkit4pySCA from GitHub with:
# install.packages("devtools") devtools::install_github("currocam/toolkit4pySCA")
This is a basic example showing how to perform a typical workflow with the HBG2_HUMAN sequence. First of all, we load the libraries, define an AAString with the sequence of interest and indicate that, when possible, we want progress bars to appear.
library(toolkit4pySCA) library(Biostrings) library(tidyverse) library(kableExtra) library(magrittr) #progressr::handlers(global = TRUE) # >sp|P69892|HBG2_HUMAN Hemoglobin subunit gamma-2 OS=Homo sapiens OX=9606 GN=HBG2 PE=1 SV=2 example.seq <- AAString( paste0("MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK", "VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG", "KEFTPEVQASWQKMVTGVASALSSRYH" ))
First, we will perform a search for sequences homologous to HBG2_HUMAN using phmmer. We will use the quick_AA_search_using_phmmer
function to programmatically access hmmer. This function sends a request to HMMER with the sequence of interest and the target database and returns a DataFrame with the urls where to download the resulting files.
df.phmmer <- quick_AA_search_using_phmmer( seqaa = example.seq, seqdb = c("swissprot", "pdb"))
Using purrr's map family of functions, we can download directly from R the XML files (containing all the information related to the sequences) and the fullseq-fasta files (containing the unaligned sequences).
xml.files <-df.phmmer %$% purrr::map2_chr(seqdb, xml_format, ~{ path_xml <- paste0("inst/extdata/HBG2_HUMAN", .x, ".xml") read_xml(.y) %>% write_xml(path_xml) path_xml }) fullseq.fasta.files <-df.phmmer %$% purrr::map2_chr(seqdb, fullfasta_format, ~{ path_fasta <- paste0("inst/extdata/HBG2_HUMAN", .x, "fullseq.fa") readAAStringSet(.y) %>% writeXStringSet(path_fasta) path_fasta })
The next function we are going to use from the library is the extract_tidy_df_from_hmmer
function. This function extracts the information contained in the xml file and converts it into a sorted DataFrame, thus facilitating further preprocessing. To read the fasta files, we can make use of the readAAStringSet
function of Biostrings. We will, again, make use of the map family of functions to create a single DataFrame.
fasta.files <- readAAStringSet(fullseq.fasta.files) df.HBG2_HUMAN <- xml.files %>% purrr::map_dfr( ~ xml2::read_xml(.x) %>% extract_tidy_df_from_hmmer, .id = "db")
df.HBG2_HUMAN <- toolkit4pySCA::phmmer_HBG2_HUMAN$df fasta.files <- toolkit4pySCA::phmmer_HBG2_HUMAN$fullfasta
Now, we can explore the data obtained using R. For example, let's look at the taxonomic distribution and domain architecture.
df.HBG2_HUMAN %>% select(alisqacc, ph, arch) %>% #We select the phylo and the domain architecture. distinct(.keep_all = TRUE) %>% #We remove repeated rows (produced when ndom >1). ggplot(aes(y=ph, fill = arch)) + geom_bar(stat = "count")
As well as preprocessing them, we will filter out those sequences with a significant evalue of < 0.001, which belong to Chordata and whose sequences are not redundant with each other.
#We create a vector with the names of the sequences we have filtered. acc.filtered <- df.HBG2_HUMAN %>% filter(evalue < 0.001) %>% filter(ph == "Chordata") %>% pull(alisqacc) #We select the sequences of the file that correspond and make sure that they are redundant. fasta.files.filtered <- fasta.files[acc.filtered] %>% unique()
Now, we can use the download_tax_from_ncbi()
function to download the taxonomic information associated with these sequences.
df.HBG2_HUMAN.ann <- df.HBG2_HUMAN %>% filter(alisqacc %in% names(fasta.files.filtered)) %>%# We choose non-redundant seq pull(taxid) %>% download_tax_from_ncbi() %>% mutate(taxid = as.numeric(taxid))%>% left_join(df.HBG2_HUMAN, by = c("taxid" = "taxid")) df.HBG2_HUMAN.ann %>% select(alisqacc, genus, desc) %>% head() %>% kbl()
The key step in this analysis is to obtain a good alignment with the sequences. However, for simplicity, we will continue to use ClustalWAlignment.
myClustalWAlignment <- msa::msaClustalW(fasta.files.filtered) %>% unmasked() myClustalWAlignment #We write the alignment in a file.
At this point, we can annotate as we wish the sequence headers. This point is critical because it is the information that pySCA will use later to build the phylogenetic and functional dictionary. We are going to construct a text string with the taxonomic information we want to use and then write the file.
df.HBG2_HUMAN.ann <- df.HBG2_HUMAN.ann %>% distinct(alisqacc, .keep_all = TRUE)%>% unite( "unite_tax", c("superkingdom","kingdom","subphylum", "family", "genus"), sep = ",", na.rm = TRUE, remove = FALSE ) header <- myClustalWAlignment %>% names() %>% paste(df.HBG2_HUMAN.ann$unite_tax, sep = "|") header[1:5] # myClustalWAlignment %>% # setNames(header) %>% # writeXStringSet("path_to_aln", format = "fasta")
Now, it's time to use the pySCA scripts, scaProcessMSA
, scaCore
, and scaSectorID
.
scaProcessMSA -a HBG2_HUMAN.ClustalW.aln scaCore -i HBG2_HUMAN.ClustalW.db scaSectorID -i HBG2_HUMAN.ClustalW.db
The resulting file is a .db file, and it is the pickle database with all the results. We can now access them using Python and following the pySCA example notebooks. However, we can access some of the information using the following functions provided by the library. To use them we need to have pySCA installed, which should not be a problem, since to read the file we would have to generate it first. If you do not have it installed, use the force = TRUE option. In fact, this command can be used to create a virtual environment or a conda environment and automatically install all the libraries.
pysca <- import_pySCA_module_using_reticulate( virtualenv = "r-pySCA", force_installing = FALSE) db <- read_pySCA_pickle("inst/extdata/HBG2_HUMAN.ClustalW.db") attributes(db)
db <- pySCA_HBG2_HUMAN
Now, we can analyze all the information calculated by pySCA using R, as it is contained in the pySCA_HBG2_HUMAN
file. To access it we use the $
operators.
However, there are a number of functions that facilitate the creation of the graphs that appear in pySCA notebooks. These functions are grouped 2 by 2, one to gather the information into a data frame suitable for plot and analysis and the other to make a default graph. Let's look at some examples.
First, we print out a few statistics describing the alignment:
library(glue) glue("After processing, the alignment size is {Nseq} sequences and {Npos}", " positions. With sequence weights, there are {effseqs} effective sequences", Nseq = db$sequence$Nseq, Npos = db$sequence$Npos, effseqs = db$sequence$effseqs)
To examine alignment composition, we plot a histogram of all pairwise sequence identities.
#data <- pairwise_sequence_identities_data(db) pairwise_sequence_identities_hist(db)
And a global view of the sequence similarity matrix.
#data <- global_similarity_matrix_data(db) global_similarity_matrix_heatmap(db)
We can observe the relationship between sequence similarity and phylogeny by looking at the projection of the sequences onto the first 2 PCA and IC, distinguishing which weights were included.
library(ggplot2) db %>% sequence_correlation_matrix_projections_data() %>% left_join(df.HBG2_HUMAN.ann %>% mutate(hd = header), by = c("hd" = "hd"))%>% ggplot(aes(x=V1, y=V2)) + geom_point(aes(colour = .data$class)) + ggplot2::labs(x = "V1", y = "V2")+ ggplot2::facet_grid(.data$type.projections ~ .data$mode, scales = "free")
We can also plot the position-specific conservation values or access the SCA correlation matrix.
sca.matrix <- SCA_correlation_matrix_data(db) position_specific_conservation_plot(db)
Particularly useful is the function eigenspectrum_SCA_positional_coevolution_matrix_plot
, because it is used to choose the number of significant eigenmodes. That function plots the eigenspectrum histogram of the SCA positional coevolution matrix and 10 matrix randomization trials for comparison.
eigenspectrum_SCA_positional_coevolution_matrix_plot(db)
To access the sectors we can make use of the DataFrame that is created when we read the pickle file. If we wanted to visualize the results in pyMol, we could make use of the write_sectors_for_pymol
function.
sectors <- db$sector$ics_tidy_df sectors %>% group_by(ics) %>% summarise(positions = n()) write_sectors_for_pymol(db, fetch_pdb = "1FDH",output_file = "inst/extdata/1FDH_sectors.pml")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.