The objective of toolkit4pySCA is to provide a set of functions that allow to obtain a sequence alignment using HMMER and to associate taxonomic information to it. Then, it allows to read the file generated by pySCA in R and to easily extract and analyze part of the information.
You can install the development version of toolkit4pySCA from GitHub with:
# install.packages("devtools")
devtools::install_github("currocam/toolkit4pySCA")
This is a basic example showing how to perform a typical workflow with the HBG2_HUMAN sequence. First of all, we load the libraries, define an AAString with the sequence of interest and indicate that, when possible, we want progress bars to appear.
library(toolkit4pySCA)
library(Biostrings)
library(tidyverse)
library(kableExtra)
library(magrittr)
#progressr::handlers(global = TRUE)
# >sp|P69892|HBG2_HUMAN Hemoglobin subunit gamma-2 OS=Homo sapiens OX=9606 GN=HBG2 PE=1 SV=2
example.seq <- AAString(
paste0("MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPK",
"VKAHGKKVLTSLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFG",
"KEFTPEVQASWQKMVTGVASALSSRYH"
))
First, we will perform a search for sequences homologous to HBG2_HUMAN
using phmmer. We will use the quick_AA_search_using_phmmer
function to
programmatically access hmmer. This function sends a request to HMMER
with the sequence of interest and the target database and returns a
DataFrame with the urls where to download the resulting files.
df.phmmer <- quick_AA_search_using_phmmer(
seqaa = example.seq,
seqdb = c("swissprot", "pdb"))
Using purrr’s map family of functions, we can download directly from R the XML files (containing all the information related to the sequences) and the fullseq-fasta files (containing the unaligned sequences).
xml.files <-df.phmmer %$%
purrr::map2_chr(seqdb, xml_format, ~{
path_xml <- paste0("inst/extdata/HBG2_HUMAN", .x, ".xml")
read_xml(.y) %>% write_xml(path_xml)
path_xml
})
fullseq.fasta.files <-df.phmmer %$%
purrr::map2_chr(seqdb, fullfasta_format, ~{
path_fasta <- paste0("inst/extdata/HBG2_HUMAN", .x, "fullseq.fa")
readAAStringSet(.y) %>% writeXStringSet(path_fasta)
path_fasta
})
The next function we are going to use from the library is the
extract_tidy_df_from_hmmer
function. This function extracts the
information contained in the xml file and converts it into a sorted
DataFrame, thus facilitating further preprocessing. To read the fasta
files, we can make use of the readAAStringSet
function of Biostrings.
We will, again, make use of the map family of functions to create a
single DataFrame.
fasta.files <- readAAStringSet(fullseq.fasta.files)
df.HBG2_HUMAN <- xml.files %>%
purrr::map_dfr(
~ xml2::read_xml(.x) %>%
extract_tidy_df_from_hmmer,
.id = "db")
Now, we can explore the data obtained using R. For example, let’s look at the taxonomic distribution and domain architecture.
df.HBG2_HUMAN %>%
select(alisqacc, ph, arch) %>% #We select the phylo and the domain architecture.
distinct(.keep_all = TRUE) %>% #We remove repeated rows (produced when ndom >1).
ggplot(aes(y=ph, fill = arch)) +
geom_bar(stat = "count")
As well as preprocessing them, we will filter out those sequences with a significant evalue of \< 0.001, which belong to Chordata and whose sequences are not redundant with each other.
#We create a vector with the names of the sequences we have filtered.
acc.filtered <- df.HBG2_HUMAN %>%
filter(evalue < 0.001) %>%
filter(ph == "Chordata") %>%
pull(alisqacc)
#We select the sequences of the file that correspond and make sure that they are redundant.
fasta.files.filtered <- fasta.files[acc.filtered] %>%
unique()
Now, we can use the download_tax_from_ncbi()
function to download the
taxonomic information associated with these sequences.
df.HBG2_HUMAN.ann <- df.HBG2_HUMAN %>%
filter(alisqacc %in% names(fasta.files.filtered)) %>%# We choose non-redundant seq
pull(taxid) %>%
download_tax_from_ncbi() %>%
mutate(taxid = as.numeric(taxid))%>%
left_join(df.HBG2_HUMAN, by = c("taxid" = "taxid"))
#> Registered S3 method overwritten by 'hoardr':
#> method from
#> print.cache_info httr
df.HBG2_HUMAN.ann %>%
select(alisqacc, genus, desc) %>%
head() %>% kbl()
alisqacc
genus
desc
HBB_MELAE
Melanogrammus
Hemoglobin subunit beta
HBA3_XENTR
Xenopus
Hemoglobin subunit alpha-3
HBAT_PAPAN
Papio
Hemoglobin subunit theta-1
HBB_PAPAN
Papio
Hemoglobin subunit beta
HBBB_SERQU
Seriola
Hemoglobin subunit beta-B
HBAA_SERQU
Seriola
Hemoglobin subunit alpha-A
The key step in this analysis is to obtain a good alignment with the sequences. However, for simplicity, we will continue to use ClustalWAlignment.
myClustalWAlignment <- msa::msaClustalW(fasta.files.filtered) %>%
unmasked()
#> use default substitution matrix
myClustalWAlignment
#> AAStringSet object of length 194:
#> width seq names
#> [1] 315 ------------------MGLS...---------------------- MYG_ANAPO
#> [2] 315 ------------------MGLS...---------------------- MYG_STRCA
#> [3] 315 ------------------MGLS...---------------------- MYG_CARCR
#> [4] 315 ------------------MGLS...---------------------- MYG_VARVA
#> [5] 315 ------------------MVLS...---------------------- 4pqb_A
#> ... ... ...
#> [190] 315 ----------------------...---------------------- 5eys_A
#> [191] 315 ----------------------...---------------------- 5f0b_A
#> [192] 315 --------------------ME...---------------------- 4mpm_A
#> [193] 315 -----------------MEKLS...---------------------- NGB_DISMA
#> [194] 315 -----------------MEKLS...---------------------- NGB_DANRE
#We write the alignment in a file.
At this point, we can annotate as we wish the sequence headers. This point is critical because it is the information that pySCA will use later to build the phylogenetic and functional dictionary. We are going to construct a text string with the taxonomic information we want to use and then write the file.
df.HBG2_HUMAN.ann <- df.HBG2_HUMAN.ann %>%
distinct(alisqacc, .keep_all = TRUE)%>%
unite(
"unite_tax", c("superkingdom","kingdom","subphylum", "family", "genus"),
sep = ",", na.rm = TRUE, remove = FALSE
)
header <- myClustalWAlignment %>%
names() %>%
paste(df.HBG2_HUMAN.ann$unite_tax, sep = "|")
header[1:5]
#> [1] "MYG_ANAPO|Eukaryota,Metazoa,Craniata,Gadidae,Melanogrammus"
#> [2] "MYG_STRCA|Eukaryota,Metazoa,Craniata,Pipidae,Xenopus"
#> [3] "MYG_CARCR|Eukaryota,Metazoa,Craniata,Cercopithecidae,Papio"
#> [4] "MYG_VARVA|Eukaryota,Metazoa,Craniata,Cercopithecidae,Papio"
#> [5] "4pqb_A|Eukaryota,Metazoa,Craniata,Carangidae,Seriola"
# myClustalWAlignment %>%
# setNames(header) %>%
# writeXStringSet("path_to_aln", format = "fasta")
Now, it’s time to use the pySCA scripts, scaProcessMSA
, scaCore
, and
scaSectorID
.
scaProcessMSA -a HBG2_HUMAN.ClustalW.aln
scaCore -i HBG2_HUMAN.ClustalW.db
scaSectorID -i HBG2_HUMAN.ClustalW.db
The resulting file is a .db file, and it is the pickle database with all the results. We can now access them using Python and following the pySCA example notebooks. However, we can access some of the information using the following functions provided by the library. To use them we need to have pySCA installed, which should not be a problem, since to read the file we would have to generate it first. If you do not have it installed, use the force = TRUE option. In fact, this command can be used to create a virtual environment or a conda environment and automatically install all the libraries.
pysca <- import_pySCA_module_using_reticulate(
virtualenv = "r-pySCA", force_installing = FALSE)
db <- read_pySCA_pickle("inst/extdata/HBG2_HUMAN.ClustalW.db")
attributes(db)
Now, we can analyze all the information calculated by pySCA using R, as
it is contained in the pySCA_HBG2_HUMAN
file. To access it we use the
$
operators.
However, there are a number of functions that facilitate the creation of the graphs that appear in pySCA notebooks. These functions are grouped 2 by 2, one to gather the information into a data frame suitable for plot and analysis and the other to make a default graph. Let’s look at some examples.
First, we print out a few statistics describing the alignment:
library(glue)
#>
#> Attaching package: 'glue'
#> The following object is masked from 'package:IRanges':
#>
#> trim
glue("After processing, the alignment size is {Nseq} sequences and {Npos}",
" positions. With sequence weights, there are {effseqs} effective sequences",
Nseq = db$sequence$Nseq, Npos = db$sequence$Npos, effseqs = db$sequence$effseqs)
#> After processing, the alignment size is 175 sequences and 139 positions. With sequence weights, there are 54.4160788041247 effective sequences
To examine alignment composition, we plot a histogram of all pairwise sequence identities.
#data <- pairwise_sequence_identities_data(db)
pairwise_sequence_identities_hist(db)
And a global view of the sequence similarity matrix.
#data <- global_similarity_matrix_data(db)
global_similarity_matrix_heatmap(db)
We can observe the relationship between sequence similarity and phylogeny by looking at the projection of the sequences onto the first 2 PCA and IC, distinguishing which weights were included.
library(ggplot2)
db %>%
sequence_correlation_matrix_projections_data() %>%
left_join(df.HBG2_HUMAN.ann %>%
mutate(hd = header), by = c("hd" = "hd"))%>%
ggplot(aes(x=V1, y=V2)) +
geom_point(aes(colour = .data$class)) +
ggplot2::labs(x = "V1", y = "V2")+
ggplot2::facet_grid(.data$type.projections ~ .data$mode, scales = "free")
We can also plot the position-specific conservation values or access the SCA correlation matrix.
sca.matrix <- SCA_correlation_matrix_data(db)
position_specific_conservation_plot(db)
Particularly useful is the function
eigenspectrum_SCA_positional_coevolution_matrix_plot
, because it is
used to choose the number of significant eigenmodes. That function plots
the eigenspectrum histogram of the SCA positional coevolution matrix and
10 matrix randomization trials for comparison.
eigenspectrum_SCA_positional_coevolution_matrix_plot(db)
To
access the sectors we can make use of the DataFrame that is created when
we read the pickle file. If we wanted to visualize the results in pyMol,
we could make use of the write_sectors_for_pymol
function.
sectors <- db$sector$ics_tidy_df
sectors %>%
group_by(ics) %>%
summarise(positions = n())
#> # A tibble: 6 × 2
#> ics positions
#> <chr> <int>
#> 1 1 11
#> 2 2 17
#> 3 3 13
#> 4 4 17
#> 5 5 13
#> 6 6 8
write_sectors_for_pymol(db, fetch_pdb = "1FDH",output_file = "inst/extdata/1FDH_sectors.pml")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.