In berkgurdamar/predatoR: Mutation Impact Prediction Based on Network Properties

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
suppressPackageStartupMessages(library(predatoR))

Quick Start

In this section, we show you how to use predatoR package for mutation impact prediction.

predatoR package makes prediction about impact of a mutation based on an Adaboost model and classifies the mutations as Pathogenic or Neutral. Model was created by using Varibench and ClinVar datasets. Missense3D dataset also used for method comparison.

Installation

You can install the predatoR via devtools:

``` {r installation_guide, eval=FALSE} library(devtools) install_github("berkgurdamar/predatoR")

### Workflow

<img src="./predatoR_workflow.png" style="max-width:100%;" />

`predatoR()` function is the wrapper function of the `predatoR` package. `predatoR()` gets a data.frame structure as an input. Input should contain 5 mandatory and 1 optional columns. Mandatory columns are  __'PDB_ID'__, __'Chain'__, __'Position'__, __'Orig_AA'__ and  __'Mut_AA'__. __'Gene_Name'__ column is optional. Predictions can be made by using 2 different models, 5 Angstrom (Å)-all atoms model and 7Å-carbon alpha (Cα) atoms only model.

`predatoR()` function works on each PDB ID respectively. For each PDB ID;  

 - Downloads/Reads PDB file
 - Calculates distance between each atom in the structure
 - Creates a network from PDB
 - Calculates Degree Centrality Z-Score of each atom
 - Calculates Eigen Centrality Z-Score of each atom
 - Calculates Shortest Path Z-Score of each atom
 - Calculates Betweenness Z-Score of each atom
 - Calculates Clique Z-Score of each atom
 - Calculates PageRank Z-Score of each atom
 - Gets [gnomAD](https://gnomad.broadinstitute.org/) Synonymous Z-Score, Missense Z-Score, and pLI Score
 - Gets BLOSUM62 score of the mutation
 - Finds the number of [KEGG Pathways](https://www.genome.jp/kegg/) which contains the input gene
 - Gets [Genic Intolerance](http://genic-intolerance.org/) Score
 - Finds the number of GO terms associated with the input gene
 - Finds the number of diseases associated with the input gene from [DisGeNET](https://www.disgenet.org/)
 - Gets the Gene Essentiality Score from Online Gene Essentiality ([OGEE](https://v3.ogee.info/#/home)) Database
 - Calculates the median gene expression value of the input gene from [GTEx](https://gtexportal.org/home/)
 - Calculates 6 different features from Accessible Surface Area and Hydrophobicity Scale of reference and mutant amino acids
 - Makes prediction


If `predatoR()` couldn't find any information of any gene such as 'Genic Intolerance Score', those inputs will be removed from the query before the prediction step.

### Arguments

 `predatoR()` function has 6 arguments; 

 - __info_df:__ Input data.frame
 - __PDB_path:__ PDB file path if the user wants to use PDB structures stored locally (default = NULL)
 - __n_threads:__ Number of threads (default = NULL)
 - __gene_name_info:__ Whether there is gene name information in the input or not (default = TRUE)
 - __distance_cutoff:__ distance cutoff for setting edges (default = 5)
 - __network_approach:__ network building approach; "all" (default) for using all atoms only or "ca" for using ca atoms only

 When running the `predatoR()` function, if gene name is included in the input, `gene_name_info`
 should be set as `TRUE`. If there is no gene name information in the input (`gene_name_info = FALSE`), 
 `predatoR()` function gets the related gene names from [Ensembl](https://www.ensembl.org/) and if there are multiple
 genes annotated for the PDB ID, choose the gene having the higher gnomAD metrics. 

 `predatoR()` can make predictions using 7Å or 5Å distance cutoffs. For exploratory purposes, different cutoffs can be used and network properties can be calculated.


 Example input with the __'Gene_Name'__ information included:

```r
test_data <- as.data.frame(rbind(c("3SQJ", "A", 196, "GLN", "LEU", "ALB"),
                               c("3SQJ", "A", 396, "GLU", "LYS", "ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
knitr::kable(test_data, align = c("c", "c", "c", "c", "c", "c"))

library(predatoR)
prediction_result <- predatoR(test_data, gene_name_info = TRUE)

Example input with the 'Gene_Name' information does not included:

test_data <- as.data.frame(rbind(c("3SQJ", "A", 196, "GLN", "LEU"),
                               c("3SQJ", "A", 396, "GLU", "LYS")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA")
knitr::kable(test_data, align = c("c", "c", "c", "c", "c"))

prediction_result <- predatoR(test_data, gene_name_info = FALSE)

predatoR() function can also work with no gene name inputs. If a mutation has no gene name information, it gets the related gene names from Ensembl and if there are multiple genes annotated for the PDB ID, gets the gene which has maximum GnomAD scores (gene_name_info should set as TRUE).

test_data <- as.data.frame(rbind(c("3SQJ", "A", 196, "GLN", "LEU", "ALB"),
                                 c("3SQJ", "A", 396, "GLU", "LYS", "ALB"),
                                 c("1A4Y", "A", 169, "PRO", "LEU", "")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
knitr::kable(test_data, align = c("c", "c", "c", "c", "c", "c"))

prediction_result <- predatoR(test_data, gene_name_info = TRUE)

predatoR() function uses max_thread_number - 1 as a default when n_threads = NULL. n_threads can be set by user.

prediction_result <- predatoR(test_data, n_threads = 8) # 8 threads will be used

predatoR() function downloads each PDB by default (PDB_path = NULL) or can use PDB files stored locally when PDB_path specified. It will automatically read and process the input PDB file from the path.

prediction_result <- predatoR(test_data, PDB_path = "path/to/PDB/files/")

Result

predatoR() function returns a data.frame which contains additional two columns; 'Prediction' and 'Probability'. 'Prediction' represents the result of the impact prediction and 'Probability' represents the probability that the mutation classified as Pathogenic or Neutral.

example_result <- as.data.frame(rbind(c("3SQJ", "A", "196", "GLN", "LEU", "ALB", "Neutral", "0.6521832"),
                                      c("3SQJ", "A", "396", "GLU", "LYS", "ALB", "Neutral", "0.6009792")))
colnames(example_result) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name", "Prediction", "Probability")
knitr::kable(example_result, align = c("c", "c", "c", "c", "c", "c", "c", "c"))

Exploratory Usage

Network properties can be calculated by using different distance cutoffs. In this approach, predatoR() does not make any prediction about the mutation, but returns a data frame contains all 24 features annotated to the dataset. Both network formalisation approaches also can be used.

# networks build using all atoms and 7.6Å cutoff
prediction_result <- predatoR(test_data, distance_cutoff = 7.6, network_approach = "all") 

# networks build using only Cα atoms and 8Å cutoff
prediction_result <- predatoR(test_data, distance_cutoff = 8, network_approach = "ca")

Utility Functions

The wrapper function predatoR() uses the utility functions below;

read_PDB()
PDB2connections()
degree_score()
eigen_centrality_score()
shorteset_path_score()
betweenness_score()
clique_score()
pagerank_score()
gnomad_scores()
BLOSUM62_score()
KEGG_pathway_number()
genic_intolerance()
GO_terms()
DisGeNET()
gene_essentiality()
GTEx()
amino_acid_features()
impact_prediction()

`read_PDB`

read_PDB() function downloads input PDB by default but can read and process locally stored PDB structures. After reading, create a matrix that contains only ATOM labelled structures.

Arguments

read_PDB() has 3 arguments;

PDB_ID: PDB ID (E.g. "3SQJ")
PDB_path: PDB file path if files stored locally (default = FALSE)
network_approach: network building approach; "all" (default) for using all atoms only or "ca" for using ca atoms only

Usage

# if file stored locally
PDB_2DN2_structure <- read_PDB("3SQJ", PDB_path = "path/to/PDB/files/") 
# if file is going to be downloded
PDB_2DN2_structure <- read_PDB("3SQJ")

`PDB2connections`

PDB2connections() function calculates distances between each ATOM in the PDB structure and creates an edge between atoms if the distance <= 5Å. As a result, returns a list contains data frames of different chains' interaction lists.

Arguments

PDB2connections() has 5 arguments;

atom_matrix: Matrix created by read_PDB() function
filtered_info_df: data.frame contains only one PDB entries
n_threads: number of threads (default = NULL)
single_run: should be set as TRUE when using PDB2connections() function alone (default = TRUE)
distance_cutoff: distance cutoff for setting edges (default = 5)

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
edge_list <- PDB2connections(atom_matrix = PDB_2DN2_structure, filtered_info_df = test_data, n_threads = 8, single_run = TRUE)

`degree_score`

degree_score() function calculates the total number of connections of all nodes. Calculates Z-scores of each node and returns scores of desired nodes.

Arguments

degree_score() has 2 arguments;

edge_list: List created by PDB2connections() function
filtered_info_df: data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
degree_z_score <- degree_score(edge_list = edge_list, filtered_info_df = test_data)

`eigen_centrality_score`

eigen_centrality_score() function calculates the total number of connections of nodes to which a node is connected. Calculates Z-scores of all nodes and returns scores of desired nodes.

Arguments

eigen_centrality_score() has 2 arguments;

edge_list: List created by PDB2connections() function
filtered_info_df: data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
eigen_centrality_z_score <- eigen_centrality_score(edge_list = edge_list, filtered_info_df = test_data)

`shorteset_path_score`

shorteset_path_score() function calculates the distances between each node via shortest.paths() function of igraph package, sum distances of each nodes, calculates Z-scores and returns scores of desired nodes.

Arguments

shorteset_path_score() has 2 arguments;

edge_list: List created by PDB2connections() function
filtered_info_df: data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
shorteset_path_z_score <- shorteset_path_score(edge_list = edge_list, filtered_info_df = test_data)

`betweenness_score`

betweenness_score() function calculates Betweenness Score of each node by using betweenness() function of igraph package, turns the scores into Z-scores and returns scores of desired nodes.

Arguments

betweenness_score() has 2 arguments;

edge_list: List created by PDB2connections() function
filtered_info_df: data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
betweenness_z_score <- betweenness_score(edge_list = edge_list, filtered_info_df = test_data)

`clique_score`

clique_score() function calculates Clique Z-scores score of each atom and returns the desired nodes' scores.

Arguments

clique_score() has 4 arguments;

edge_list: List created by PDB2connections() function
filtered_info_df: data.frame contains only one PDB entries
n_threads: number of threads (default = NULL)
single_run: should be set as TRUE when using clique_score() function alone (default = TRUE)

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
clique_z_score <- clique_score(edge_list = edge_list, filtered_info_df = test_data, n_threads = 8, single_run = TRUE)

`pagerank_score`

pagerank_score() function assigns an importance score to every node via page_rank() function of igraph package. Turns the scores into Z-scores and returns scores of desired nodes.

Arguments

pagerank_score() has 2 arguments;

edge_list: List created by PDB2connections() function
filtered_info_df: data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
pagerank_z_score <- pagerank_score(edge_list = edge_list, filtered_info_df = test_data)

`gnomad_scores`

gnomad_scores() function gets input gene's gnomAD Synonymous Z-Score, Missense Z-Score, and PLI Score from data retrieved from gnomAD. If there is no gene name information in the input, gnomad_scores() finds the genes that related with input PDB chain ID by using a dataset retrieved from Ensembl. If there are multiple genes annotated for same PDB Chain ID, gnomad_scores() function gets the gene that has maximum Synonymous Z-Score, Missense Z-Score, and pLI Score.

Arguments

gnomad_scores() has 1 arguments;

filtered_info_df: data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
gnomad_score <- gnomad_scores(filtered_info_df = test_data)

`BLOSUM62_score`

BLOSUM62_score() function returns BLOSUM62 scores of input mutations.

Arguments

BLOSUM62_score() has 1 arguments;

info_df: data.frame contains all the input mutations

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
BLOSUM62_scores <- BLOSUM62_score(info_df = test_data)

`KEGG_pathway_number`

KEGG_pathway_number() function finds the number of KEGG Pathways associated with the input genes.

Arguments

KEGG_pathway_number() has 1 arguments;

filtered_info_df: input data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
KEGG_path_number <- KEGG_pathway_number(filtered_info_df = test_data)

`genic_intolerance`

genic_intolerance() function returns Genic Intolerance score by using a dataset retrieved from Genic Intolerance.

Arguments

genic_intolerance() has 1 arguments;

filtered_info_df: input data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
genic_intolerance_score <- genic_intolerance(filtered_info_df = test_data)

`GO_terms`

GO_terms() function returns the number of GO terms associated with the input genes.

Arguments

GO_terms() has 1 arguments;

filtered_info_df: input data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
GO_terms_score <- GO_terms(filtered_info_df = test_data)

`DisGeNET`

DisGeNET() function returns number of diseases associated with input genes by using a dataset retrieved from DisGeNET.

Arguments

DisGeNET() has 1 arguments;

filtered_info_df: input data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
DisGeNET_score <- DisGeNET(filtered_info_df = test_data)

`gene_essentiality`

gene_essentiality() function returns Gene Essentiality scores of the input genes by using a dataset retrieved from Online Gene Essentiality (OGEE) Database.

Arguments

gene_essentiality() has 1 arguments;

filtered_info_df: input data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
gene_essentiality_score <- gene_essentiality(filtered_info_df = test_data)

`GTEx`

GTEx() function calculates the median gene expression value with using 54 different tissue types' median gene expression value from GTEx and returns the values of input genes.

Arguments

GTEx() has 1 arguments;

filtered_info_df: input data.frame contains only one PDB entries

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
gene_essentiality_score <- GTEx(filtered_info_df = test_data)

`amino_acid_features`

amino_acid_features() function returns 6 different features:

Accessible Surface Area of reference amino acid
Accessible Surface Area of mutant amino acid
Accessible Surface Area difference of mutant and reference amino acid
Hydrophobicity Scale of reference amino acid
Hydrophobicity Scale of mutant amino acid
Hydrophobicity Scale difference of mutant and reference amino acid

Arguments

amino_acid_features() has 1 arguments;

info_df: input data.frame contains all the mutations

Usage

test_data <- as.data.frame(rbind(c("3SQJ", "A", 3, "HIS", "GLN", "ALB"),
                                 c("3SQJ", "A", 60, "GLU", "LYS","ALB")))
colnames(test_data) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name")
gene_essentiality_score <- amino_acid_features(info_df = test_data)

`impact_prediction`

impact_prediction() function makes impact prediction and classifies the mutation as Pathogenic or Neutral.

Arguments

impact_prediction() has 3 arguments;

final_df: data.frame contains all the required information for impact prediction
distance_cutoff: distance cutoff for setting edges (default = 5)
network_approach: network building approach; "all" (default) for using all atoms only or "ca" for using ca atoms only

Usage

From an input data.frame containing PDB_ID, Chain, Position, Orig_AA, Mut_AA, Gene_Name, degree_z_score, eigen_z_score, shortest_path_z, betwenness_scores_z, clique_z_score, pagerank_z_score, syn_z, mis_z, pLI, blosum62_scores, kegg_pathway_number, genic_intolerance, go_terms, disgenet, gene_essentiality, gtex, ref_asa, mut_asa, asa_diff, ref_hyd, mut_hyd and hyd_diff scores impact_prediction() function classifies the mutation as Pathogenic or Neutral.

Example input:

final_df <- as.data.frame(rbind(strsplit("3SQJ, A, 196, GLN, LEU, ALB, -0.0717218071350537, -0.379784447875971, -1.61712172099028, 0.709536957153219, -0.0906809533248494, 0.0255568093251824, -0.91513, 0.65574, 0.64329, -2, 12, -0.89, 4, 81, 1, 1.515105, 177.7, 173.7, -4, -3.68, 3.8, 7.48", ", ")[[1]],
                                strsplit("3SQJ, A, 396, GLU, LYS, ALB, -0.280119006995483, -0.406648724044331, -0.114702339705536, -0.531050971889894, 0.0706892006523297, -0.318847672527394, -0.91513, 0.65574, 0.64329, 1, 12, -0.89, 4, 81, 1, 1.515105, 182.9, 215.2, 32.3, -3.2, -4.11, -0.91", ", ")[[1]]))
colnames(final_df) <- strsplit("PDB_ID, Chain, Position, Orig_AA, Mut_AA, Gene_Name, degree_z_score, eigen_z_score, shortest_path_z, betweenness_scores_z, clique_z_score, pagerank_z_score, syn_z, mis_z, pLI, blosum62_scores, kegg_pathway_number, genic_intolerance, go_terms, disgenet, gene_essentiality, gtex, ref_asa, mut_asa, asa_diff, ref_hyd, mut_hyd, hyd_diff", ", ")[[1]]
DT::datatable(final_df,options = list(scrollX = TRUE))

prediction_result <- impact_prediction(final_df)

example_result <- as.data.frame(rbind(c("3SQJ", "A", "196", "GLN", "LEU", "ALB", "Neutral", "0.6521832"),
                                      c("3SQJ", "A", "396", "GLU", "LYS", "ALB", "Neutral", "0.6009792")))
colnames(example_result) <- c("PDB_ID", "Chain", "Position", "Orig_AA", "Mut_AA", "Gene_Name", "Prediction", "Probability")
knitr::kable(example_result, align = c("c", "c", "c", "c", "c", "c", "c", "c"))

berkgurdamar/predatoR documentation built on Jan. 10, 2023, 10:51 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

berkgurdamar/predatoR Mutation Impact Prediction Based on Network Properties

In berkgurdamar/predatoR: Mutation Impact Prediction Based on Network Properties

Quick Start

Installation

Result

Exploratory Usage

Utility Functions

read_PDB

Arguments

Usage

PDB2connections

Arguments

Usage

degree_score

Arguments

Usage

eigen_centrality_score

Arguments

Usage

shorteset_path_score

Arguments

Usage

betweenness_score

Arguments

Usage

clique_score

Arguments

Usage

pagerank_score

Arguments

Usage

gnomad_scores

Arguments

Usage

BLOSUM62_score

Arguments

Usage

KEGG_pathway_number

Arguments

Usage

genic_intolerance

Arguments

Usage

GO_terms

Arguments

Usage

DisGeNET

Arguments

Usage

gene_essentiality

Arguments

Usage

GTEx

Arguments

Usage

amino_acid_features

Arguments

Usage

impact_prediction

Arguments

Usage

R Package Documentation

Browse R Packages

We want your feedback!

berkgurdamar/predatoR
Mutation Impact Prediction Based on Network Properties

`read_PDB`

`PDB2connections`

`degree_score`

`eigen_centrality_score`

`shorteset_path_score`

`betweenness_score`

`clique_score`

`pagerank_score`

`gnomad_scores`

`BLOSUM62_score`

`KEGG_pathway_number`

`genic_intolerance`

`GO_terms`

`DisGeNET`

`gene_essentiality`

`GTEx`

`amino_acid_features`

`impact_prediction`