knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) suppressPackageStartupMessages(library(predatoR))
In this section, we show you how to use predatoR
package for mutation impact prediction.
predatoR
package makes prediction about impact of a mutation based on an Adaboost model and classifies the mutations as Pathogenic or Neutral. Model was created by using Varibench and ClinVar datasets. Missense3D dataset also used for method comparison.
You can install the predatoR via devtools:
``` {r installation_guide, eval=FALSE} library(devtools) install_github("berkgurdamar/predatoR")
### Workflow <img src="./predatoR_workflow.png" style="max-width:100%;" /> `predatoR()` function is the wrapper function of the `predatoR` package. `predatoR()` gets a data.frame structure as an input. Input should contain 5 mandatory and 1 optional columns. Mandatory columns are __'PDB_ID'__, __'Chain'__, __'Position'__, __'Orig_AA'__ and __'Mut_AA'__. __'Gene_Name'__ column is optional. Predictions can be made by using 2 different models, 5 Angstrom (Å)-all atoms model and 7Å-carbon alpha (Cα) atoms only model. `predatoR()` function works on each PDB ID respectively. For each PDB ID; - Downloads/Reads PDB file - Calculates distance between each atom in the structure - Creates a network from PDB - Calculates Degree Centrality Z-Score of each atom - Calculates Eigen Centrality Z-Score of each atom - Calculates Shortest Path Z-Score of each atom - Calculates Betweenness Z-Score of each atom - Calculates Clique Z-Score of each atom - Calculates PageRank Z-Score of each atom - Gets [gnomAD](https://gnomad.broadinstitute.org/) Synonymous Z-Score, Missense Z-Score, and pLI Score - Gets BLOSUM62 score of the mutation - Finds the number of [KEGG Pathways](https://www.genome.jp/kegg/) which contains the input gene - Gets [Genic Intolerance](http://genic-intolerance.org/) Score - Finds the number of GO terms associated with the input gene - Finds the number of diseases associated with the input gene from [DisGeNET](https://www.disgenet.org/) - Gets the Gene Essentiality Score from Online Gene Essentiality ([OGEE](https://v3.ogee.info/#/home)) Database - Calculates the median gene expression value of the input gene from [GTEx](https://gtexportal.org/home/) - Calculates 6 different features from Accessible Surface Area and Hydrophobicity Scale of reference and mutant amino acids - Makes prediction If `predatoR()` couldn't find any information of any gene such as 'Genic Intolerance Score', those inputs will be removed from the query before the prediction step. ### Arguments `predatoR()` function has 6 arguments; - __info_df:__ Input data.frame - __PDB_path:__ PDB file path if the user wants to use PDB structures stored locally (default = NULL) - __n_threads:__ Number of threads (default = NULL) - __gene_name_info:__ Whether there is gene name information in the input or not (default = TRUE) - __distance_cutoff:__ distance cutoff for setting edges (default = 5) - __network_approach:__ network building approach; "all" (default) for using all atoms only or "ca" for using ca atoms only When running the `predatoR()` function, if gene name is included in the input, `gene_name_info` should be set as `TRUE`. If there is no gene name information in the input (`gene_name_info = FALSE`), `predatoR()` function gets the related gene names from [Ensembl](https://www.ensembl.org/) and if there are multiple genes annotated for the PDB ID, choose the gene having the higher gnomAD metrics. `predatoR()` can make predictions using 7Å or 5Å distance cutoffs. For exploratory purposes, different cutoffs can be used and network properties can be calculated. Example input with the __'Gene_Name'__ information included: ```r test_data <- as.data.frame(rbind(c("3SQJ", "A", 196, "GLN", "LEU", "ALB"), c("3SQJ", "A", 396, "GLU", "LYS", "ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") knitr::kable(test_data, align = c("c", "c", "c", "c", "c", "c"))
library(predatoR) prediction_result <- predatoR(test_data, gene_name_info = TRUE)
Example input with the 'Gene_Name' information does not included:
test_data <- as.data.frame(rbind(c("3SQJ", "A", 196, "GLN", "LEU"), c("3SQJ", "A", 396, "GLU", "LYS"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA") knitr::kable(test_data, align = c("c", "c", "c", "c", "c"))
prediction_result <- predatoR(test_data, gene_name_info = FALSE)
predatoR()
function can also work with no gene name inputs. If a mutation has no gene name
information, it gets the related gene names from Ensembl and if there are multiple
genes annotated for the PDB ID, gets the gene which has maximum GnomAD scores (gene_name_info
should set as TRUE
).
test_data <- as.data.frame(rbind(c("3SQJ", "A", 196, "GLN", "LEU", "ALB"), c("3SQJ", "A", 396, "GLU", "LYS", "ALB"), c("1A4Y", "A", 169, "PRO", "LEU", ""))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") knitr::kable(test_data, align = c("c", "c", "c", "c", "c", "c"))
prediction_result <- predatoR(test_data, gene_name_info = TRUE)
predatoR()
function uses max_thread_number - 1
as a default when n_threads = NULL
. n_threads
can be set
by user.
prediction_result <- predatoR(test_data, n_threads = 8) # 8 threads will be used
predatoR()
function downloads each PDB by default (PDB_path = NULL
) or can use PDB files stored
locally when PDB_path
specified. It will automatically read and process the input PDB file from the
path.
prediction_result <- predatoR(test_data, PDB_path = "path/to/PDB/files/")
predatoR()
function returns a data.frame which contains additional two columns; 'Prediction' and 'Probability'. 'Prediction' represents the result of the impact prediction and 'Probability' represents the probability that the mutation classified as Pathogenic or Neutral.
example_result <- as.data.frame(rbind(c("3SQJ", "A", "196", "GLN", "LEU", "ALB", "Neutral", "0.6521832"), c("3SQJ", "A", "396", "GLU", "LYS", "ALB", "Neutral", "0.6009792"))) colnames(example_result) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name", "Prediction", "Probability") knitr::kable(example_result, align = c("c", "c", "c", "c", "c", "c", "c", "c"))
Network properties can be calculated by using different distance cutoffs. In this approach, predatoR()
does not make any prediction about the mutation, but returns a data frame contains all 24 features annotated to the dataset. Both network formalisation approaches also can be used.
# networks build using all atoms and 7.6Å cutoff prediction_result <- predatoR(test_data, distance_cutoff = 7.6, network_approach = "all") # networks build using only Cα atoms and 8Å cutoff prediction_result <- predatoR(test_data, distance_cutoff = 8, network_approach = "ca")
The wrapper function predatoR()
uses the utility functions below;
read_PDB()
PDB2connections()
degree_score()
eigen_centrality_score()
shorteset_path_score()
betweenness_score()
clique_score()
pagerank_score()
gnomad_scores()
BLOSUM62_score()
KEGG_pathway_number()
genic_intolerance()
GO_terms()
DisGeNET()
gene_essentiality()
GTEx()
amino_acid_features()
impact_prediction()
read_PDB
read_PDB()
function downloads input PDB by default but can read and process locally stored PDB structures.
After reading, create a matrix that contains only ATOM labelled structures.
read_PDB()
has 3 arguments;
# if file stored locally PDB_2DN2_structure <- read_PDB("3SQJ", PDB_path = "path/to/PDB/files/") # if file is going to be downloded PDB_2DN2_structure <- read_PDB("3SQJ")
PDB2connections
PDB2connections()
function calculates distances between each ATOM in the PDB structure
and creates an edge between atoms if the distance <= 5Å. As a result, returns a list contains data frames of different chains' interaction lists.
PDB2connections()
has 5 arguments;
read_PDB()
functionPDB2connections()
function alone (default = TRUE)test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") edge_list <- PDB2connections(atom_matrix = PDB_2DN2_structure, filtered_info_df = test_data, n_threads = 8, single_run = TRUE)
degree_score
degree_score()
function calculates the total number of connections of all nodes.
Calculates Z-scores of each node and returns scores of desired nodes.
degree_score()
has 2 arguments;
PDB2connections()
functiontest_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") degree_z_score <- degree_score(edge_list = edge_list, filtered_info_df = test_data)
eigen_centrality_score
eigen_centrality_score()
function calculates the total number of connections of nodes to which a node is connected.
Calculates Z-scores of all nodes and returns scores of desired nodes.
eigen_centrality_score()
has 2 arguments;
PDB2connections()
functiontest_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") eigen_centrality_z_score <- eigen_centrality_score(edge_list = edge_list, filtered_info_df = test_data)
shorteset_path_score
shorteset_path_score()
function calculates the distances between each node via shortest.paths()
function
of igraph package, sum distances of each nodes, calculates Z-scores and returns scores of desired nodes.
shorteset_path_score()
has 2 arguments;
PDB2connections()
functiontest_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") shorteset_path_z_score <- shorteset_path_score(edge_list = edge_list, filtered_info_df = test_data)
betweenness_score
betweenness_score()
function calculates Betweenness Score of each node by using betweenness()
function of igraph package, turns the scores into Z-scores and returns scores of desired nodes.
betweenness_score()
has 2 arguments;
PDB2connections()
functiontest_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") betweenness_z_score <- betweenness_score(edge_list = edge_list, filtered_info_df = test_data)
clique_score
clique_score()
function calculates Clique Z-scores score of each atom and returns the desired nodes' scores.
clique_score()
has 4 arguments;
PDB2connections()
functionclique_score()
function alone (default = TRUE)test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") clique_z_score <- clique_score(edge_list = edge_list, filtered_info_df = test_data, n_threads = 8, single_run = TRUE)
pagerank_score
pagerank_score()
function assigns an importance score to every node via
page_rank()
function of igraph package. Turns the scores into Z-scores and returns scores of desired nodes.
pagerank_score()
has 2 arguments;
PDB2connections()
functiontest_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") pagerank_z_score <- pagerank_score(edge_list = edge_list, filtered_info_df = test_data)
gnomad_scores
gnomad_scores()
function gets input gene's gnomAD Synonymous Z-Score, Missense Z-Score, and PLI Score
from data retrieved from gnomAD. If there is no gene name information in the input,
gnomad_scores()
finds the genes that related with input PDB chain ID by using a dataset retrieved from Ensembl.
If there are multiple genes annotated for same PDB Chain ID, gnomad_scores()
function gets the gene that has maximum Synonymous Z-Score, Missense Z-Score, and pLI Score.
gnomad_scores()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") gnomad_score <- gnomad_scores(filtered_info_df = test_data)
BLOSUM62_score
BLOSUM62_score()
function returns BLOSUM62 scores of input mutations.
BLOSUM62_score()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") BLOSUM62_scores <- BLOSUM62_score(info_df = test_data)
KEGG_pathway_number
KEGG_pathway_number()
function finds the number of KEGG Pathways associated with the input genes.
KEGG_pathway_number()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") KEGG_path_number <- KEGG_pathway_number(filtered_info_df = test_data)
genic_intolerance
genic_intolerance()
function returns Genic Intolerance score by using a dataset retrieved from
Genic Intolerance.
genic_intolerance()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") genic_intolerance_score <- genic_intolerance(filtered_info_df = test_data)
GO_terms
GO_terms()
function returns the number of GO terms associated with the input genes.
GO_terms()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") GO_terms_score <- GO_terms(filtered_info_df = test_data)
DisGeNET
DisGeNET()
function returns number of diseases associated with input genes by using a dataset retrieved from
DisGeNET.
DisGeNET()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") DisGeNET_score <- DisGeNET(filtered_info_df = test_data)
gene_essentiality
gene_essentiality()
function returns Gene Essentiality scores of the input genes by using a dataset retrieved from
Online Gene Essentiality (OGEE) Database.
gene_essentiality()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") gene_essentiality_score <- gene_essentiality(filtered_info_df = test_data)
GTEx
GTEx()
function calculates the median gene expression value with using 54 different tissue types' median gene expression value from
GTEx and returns the values of input genes.
GTEx()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") gene_essentiality_score <- GTEx(filtered_info_df = test_data)
amino_acid_features
amino_acid_features()
function returns 6 different features:
amino_acid_features()
has 1 arguments;
test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"), c("3SQJ", "A", 60, "GLU", "LYS","ALB"))) colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name") gene_essentiality_score <- amino_acid_features(info_df = test_data)
impact_prediction
impact_prediction()
function makes impact prediction and classifies the mutation as Pathogenic or Neutral.
impact_prediction()
has 3 arguments;
From an input data.frame containing PDB_ID, Chain, Position, Orig_AA, Mut_AA, Gene_Name, degree_z_score, eigen_z_score, shortest_path_z, betwenness_scores_z, clique_z_score, pagerank_z_score, syn_z, mis_z, pLI, blosum62_scores, kegg_pathway_number, genic_intolerance, go_terms, disgenet, gene_essentiality, gtex, ref_asa, mut_asa, asa_diff, ref_hyd, mut_hyd and hyd_diff scores impact_prediction()
function classifies the mutation as Pathogenic or Neutral.
Example input:
final_df <- as.data.frame(rbind(strsplit("3SQJ, A, 196, GLN, LEU, ALB, -0.0717218071350537, -0.379784447875971, -1.61712172099028, 0.709536957153219, -0.0906809533248494, 0.0255568093251824, -0.91513, 0.65574, 0.64329, -2, 12, -0.89, 4, 81, 1, 1.515105, 177.7, 173.7, -4, -3.68, 3.8, 7.48", ", ")[[1]], strsplit("3SQJ, A, 396, GLU, LYS, ALB, -0.280119006995483, -0.406648724044331, -0.114702339705536, -0.531050971889894, 0.0706892006523297, -0.318847672527394, -0.91513, 0.65574, 0.64329, 1, 12, -0.89, 4, 81, 1, 1.515105, 182.9, 215.2, 32.3, -3.2, -4.11, -0.91", ", ")[[1]])) colnames(final_df) <- strsplit("PDB_ID, Chain, Position, Orig_AA, Mut_AA, Gene_Name, degree_z_score, eigen_z_score, shortest_path_z, betweenness_scores_z, clique_z_score, pagerank_z_score, syn_z, mis_z, pLI, blosum62_scores, kegg_pathway_number, genic_intolerance, go_terms, disgenet, gene_essentiality, gtex, ref_asa, mut_asa, asa_diff, ref_hyd, mut_hyd, hyd_diff", ", ")[[1]] DT::datatable(final_df,options = list(scrollX = TRUE))
prediction_result <- impact_prediction(final_df)
example_result <- as.data.frame(rbind(c("3SQJ", "A", "196", "GLN", "LEU", "ALB", "Neutral", "0.6521832"), c("3SQJ", "A", "396", "GLU", "LYS", "ALB", "Neutral", "0.6009792"))) colnames(example_result) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name", "Prediction", "Probability") knitr::kable(example_result, align = c("c", "c", "c", "c", "c", "c", "c", "c"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.