knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(agvgd)
PIK3CA is a frequently mutated oncogene in human cancers, with mutations found in various cancer types, including endometrial, ovarian, colorectal, breast, cervical, squamous cell cancer of the head and neck, chondroma, and thyroid carcinoma [@german_2013]. The PIK3CA gene encodes the p110α catalytic subunit of phosphatidylinositol 3-kinase [@ligresti_2009].
Somatic missense mutations in PIK3CA have been reported in many human cancer types, and are predominantly found in the kinase and helical domains of the PIK3CA subunit [@karakas_2006].
According to karakas_2006 at al, some of the most common PIK3CA mutations associated with cancer include:
PIK3CA mutations are more common in tumors with mutant PTEN genes than in tumors with wild-type PTEN genes [@samuels_2010]. The discovery of oncogenic mutations in PIK3CA has emphasized the important role of PI3K in cancer pathogenesis and made it possible to quickly and easily identify tumors with activation of PI3K signaling by virtue of mutations in PIK3CA [@samuels_2010].
In this vignette we use {agvgd}
to classify the pathogenecity of the
missence mutations/variations reported in Correia and Magno et al
(2022) [@correia_magno_2022]. Here they report several mutations found in breast cancer, from which we selected the
17 variants found in the metabric dataset. These will be used to showcase how the usage of the {agvgd}
package can complement the findings from other biomedical data analyses.
knitr::include_graphics('../man/figures/correia_magno2022_fig02.png')
To use the {agvgd}
package to predict the protein impact of the metabric variants reported in Correia and Magno et al [@correia_magno_2022], we need two data sets:
These first dataset is readily available with the {agvgd}
package, and the protein alignment is provide within the companion package {protean}
. More information regarding the {protean}
package can be found in its GitHub repository.
{protean}
To retrieve the alignment of PIK3CA from the {protean}
package, we will use the functions read_profile()
and profile_path()
using the gene name "PIK3CA"
:
# Load the required packages library(protean) library(dplyr) # Using protean # Check all oncoKB proteins that have alignment files in the protean package head(protean::oncokb_genes) # Confirm that PIK3CA is present grep("PIK3CA", protean::oncokb_genes, value = TRUE) # Get the alignment profile information for PIK3CA pik3ca_profile <- protean::read_profile(protean::profile_path("PIK3CA")) head(pik3ca_profile)
{agvgd}
Next, the sequence profile must be converted to an alignment format. For this we must use the profile_to_alignment()
function from the {agvgd}
package.
# Using agvgd # Convert the protean sequence profile to an alignment pik3ca_alignment <- agvgd::profile_to_alignment(pik3ca_profile) # Check the first lines and columns of the alignment pik3ca_alignment[1:15, 1:30]
The substitutions file should have one substitution per line, and the substitutions should be formatted in the following way:
H1047Y where H (Histidine) is the original aminoacid, 1047 is the protein position for this aminoacid, and Y (Tyrosine) is the new aminoacid originated by the missence mutation.
To read in the 17 PIK3CA missense variants, we use the function read_substitutions()
from the {agvgd}
package:
# Read the variants file path <- system.file("extdata/correia_magno2022.txt", package = 'agvgd', mustWork = TRUE) missense_variants <- agvgd::read_substitutions(path) missense_variants
The list of missense variants comes with the indication of the residue number in the protein sequence, i.e. column res
in missense_variants
. As we'll see next, the function agvgd()
uses the alignment positions (column poi
in missense_variants
) instead to refer to those positions in the alignment.
The difference between res
and poi
(position of interest) is that res
counts the residues in the protein primary sequence reference, while poi
refers to the positions in the alignment, accounting for gaps.
So, before we proceed, we will update the missense_variants
tibble and
replace the NA
values with the corresponding poi
values. For that we
use the function res_to_poi()
from {agvgd}
:
missense_variants$poi <- agvgd::res_to_poi(pik3ca_alignment, missense_variants$res) head(missense_variants)
Run Align-GVGD algorithm with the function agvgd()
:
scores <- agvgd::agvgd(alignment = pik3ca_alignment, poi = missense_variants$poi, sub = missense_variants$sub) print(scores, n = Inf)
These results show that using all the sequences present in the ENSEMBL orthologous database will not provide insightful information for agvgd, since this method is very sensitive to the alignment. Here, it has classified all the variants with a score of C0, meaning that none of the changes of interest are clinically relevant.\ This is because using sequences that are so distantly related in the tree of life it is no longer reasonable to assume that those distant orthologous proteins share the same functions as the human protein of interest. Accordingly, the alignment must be curated to include a reasonable amount of sequence variation, but keeping only a subset of the proteins that are still evolutionarily close enough, so that they still share the same function, and therefore the same sequence constraints.
A first round of curation can be done automatically by removing sequences so distant that they are over 90% different from the protein of interest (bidirectionally, i.e., between human and the orthologous, and between the orthologous and the human).
# Remove distant orthologs and redundant seqs pik3ca_profile |> dplyr::filter(human_ortho_perc_id >= 90 & ortho_human_perc_id >= 90) -> profile2 # Format the profile to an alignment alignment2 <- agvgd::profile_to_alignment(profile2) # Calculate the scores for the filtered alignment scores2 <- agvgd::agvgd(alignment = alignment2, poi = missense_variants$poi, sub = missense_variants$sub) print(scores2, n = Inf)
After removal of distantly related proteins, the results show enough variability to allow the classification of variants that are know to be clinically relevant (for example, E545K with C65 score), showing the relevance of this step for the accuracy of the algorithm.
Additional rounds of curation can be done manually, by visually inspecting the alignment (for example using Jalview), and re-running the agvgd()
score calculation function using the manually curated alignments as input.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.