The goal of doppelgangerIdentifier is to find PPCC data doppelgangers that may have an inflationary effect on model accuracy.
PPCC: Pairwise Pearson’s Correlation Coefficient, the Pearson’s Correlation Coefficient between samples from two different batches.
You can install the development version of doppelgangerIdentifier from GitHub with:
# install.packages("devtools")
devtools::install_github("lr98769/doppelgangerIdentifier")
There are 4 main functions in this package:
Finds PPCC data doppelgangers in the data using batch, class and patient id meta data.
*Note: The effectiveness of getPPCCDoppelganger is affected by the efficacy of the sva::ComBat. Differences in the distribution of classes between batches affects the effectiveness of ComBat and as a result PPCC doppelganger identification.
library(doppelgangerIdentifier)
ppccDoppelgangerResults = getPPCCDoppelgangers(raw_data, meta_data)
Shows the distribution of PPCCs of different sample pairs.
library(doppelgangerIdentifier)
visualisePPCCDoppelgangers(ppccDoppelgangerResults)
Tests inflationary effects of the PPCC data doppelganger.
library(doppelgangerIdentifier)
veri_result = verifyDoppelgangers(experimentPlanFilename, raw_data, meta_data)
Visualise the accuracy of each Train-Valid Pair.
library(doppelgangerIdentifier)
visualiseVerificationResults(veri_result)
4 unprocessed data sets (no batch correction carried out) and their meta data are available and ready to use with the doppelgangerIdentifer R package.
| Name | Description | Citation | |:----:|:------------------------------------------------------:|:---------------------------------:| | rc | Renal Cell Carcinoma Proteomics Data Set | Guo et al. | | dmd | Duchenne Muscular Dystrophy (DMD) Microarry Data Set | Haslett et al. & Pescatori et al. | | leuk | Leukemia Microarry Data Set | Golub et al. & Armstrong et al. | | all | Acute Lymphoblastic Leukaemia (ALL) Microarry Data Set | Ross et al. & Yeoh et al. |
Note: Cite the original source of each data set used
In this example, we will be showing how PPCC data doppelgangers can be identified and verified for functionality with the doppelgangerIdentifier r package.
library("doppelgangerIdentifier")
Doppelganger effect: When training and validation data are similar by chance, resulting in an inflation of model accuracies on the validation dataset regardless of how we train the model.
To illustrate the impacts of the Doppelganger effect, we will be using a Renal Carcinoma (RC) gene expression dataset.
#Import RC gene expression dataset
data(rc)
#Import metadata for RC gene expression dataset
data(rc_metadata)
Functional Doppelgangers: Sample pairs between training and validation datasets that cause doppelganger effect.
When functional doppelgangers are found in both training and validation sets, the doppelganger effect is observed. Hence, it is important to identify these doppelgangers and prevent the doppelganger effect from inflating machine learning performance.
We define possible doppelgangers as samples of the same class (Both samples from Tumor or both samples from Normal) but from different patients. Sample pairs of different class would be used as negative controls while sample pairs of the same class and same patient, indicative of leakage, would be used as positive controls.
Data Doppelgangers: Sample pairs of the same class that are highly similar and hence have a high chance of being functional doppelgangers
Pairwise Pearson’s Correlation Coefficient: Pearson’s Correlation Coefficeint between sample pairs
Since it is computationally tedious to test different subsets of the data that cause the doppelganger effect, we instead identify data doppelgangers, sample pairs that are highly similar and have a high probability of being functional doppelgangers. In our implementation, we utilized Pairwise Pearson’s Correlation Coefficient (PPCC) as a metric of similarity and define data doppelgangers identified by this method as PPCC data doppelgangers.
In section “3) Effects of functional doppelgangers in machine learning”, we will demonstrate that the PPCC data doppelgangers identified by this method are functional doppelgangers.
To show how PPCC data doppelgangers are identified with the RC data set, we treat each batch as a separate data set and try to find PPCC data doppelgangers between the 2 batches.
These are the steps we use to identify PPCC data doppelgangers:
PPCC: Pairwise Pearson’s Correlation Coefficient
start_time = Sys.time()
ppccDoppelgangerResults = getPPCCDoppelgangers(rc, rc_metadata)
#> [1] "1. Batch correcting the 2 data sets with sva:ComBat..."
#> Found2batches
#> Adjusting for0covariate(s) or covariate level(s)
#> Standardizing Data across genes
#> Fitting L/S model and finding priors
#> Finding parametric adjustments
#> Adjusting the Data
#> [1] "- Data is not min-max normalized"
#> [1] "2. Calculating PPCC between samples of each batch..."
#> | |= | 1% | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 10% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
#> [1] "3. Labelling Sample Pairs according to their Class and Patient Similarities..."
#> [1] "4. Calculating PPCC cut off to identify PPCC data doppelgangers..."
#> [1] "5. Identifying PPCC data doppelgangers..."
end_time = Sys.time()
end_time-start_time
#> Time difference of 4.11564 secs
The functions above carry out step 1-5 and output the results into a list containing the following elements:
View(ppccDoppelgangerResults$Processed_data)
View(ppccDoppelgangerResults$PPCC_matrix)
PPCC_df: Data frame of PPCC between samples of different batch (NumberOfSamplePairs*5). The columns of the data frame are as follows:
Sample1: Name of first sample of the pair (From first batch)
View(ppccDoppelgangerResults$PPCC_df)
paste("PPCC cut off:", ppccDoppelgangerResults$cut_off)
#> [1] "PPCC cut off: 0.922552571814869"
To visualize the PPCC data doppelgangers, we pass the ppccDoppelgangerResults (output list of getPPCCDoppelgangers) to the visualisePPCCDoppelgangers function.
visualisePPCCDoppelgangers(ppccDoppelgangerResults)
When functional doppelgangers are in both training and validation datasets, an inflation in accuracy on the validation data set regardless of how we train the model would be observed.
We show that the PPCC data doppelgangers found above cause the doppelganger effect when included in both training and validation sets with the following steps:
start_time = Sys.time()
verificationResults = verifyDoppelgangers(
"tutorial/experiment_plans/rc_ex_plan.csv", rc, rc_metadata)
#> [1] "1. Loading Experiment Plan..."
#> [1] "2. Preprocessing data..."
#> [1] "- Batch correcting with sva:ComBat..."
#> Found2batches
#> Adjusting for0covariate(s) or covariate level(s)
#> Standardizing Data across genes
#> Fitting L/S model and finding priors
#> Finding parametric adjustments
#> Adjusting the Data
#> [1] "- Carrying out min-max normalisation"
#> [1] "3. Generating Feature Sets..."
#> [1] "4. Training KNN models..."
#> | |= | 1% | |== | 3% | |=== | 4% | |==== | 6% | |===== | 7% | |====== | 8% | |======= | 10% | |======== | 11% | |========= | 12% | |========== | 14% | |=========== | 15% | |============ | 17% | |============= | 18% | |============== | 19% | |=============== | 21% | |================ | 22% | |================= | 24% | |================== | 25% | |================== | 26% | |=================== | 28% | |==================== | 29% | |===================== | 31% | |====================== | 32% | |======================= | 33% | |======================== | 35% | |========================= | 36% | |========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================= | 42% | |============================== | 43% | |=============================== | 44% | |================================ | 46% | |================================= | 47% | |================================== | 49% | |=================================== | 50% | |==================================== | 51% | |===================================== | 53% | |====================================== | 54% | |======================================= | 56% | |======================================== | 57% | |========================================= | 58% | |========================================== | 60% | |=========================================== | 61% | |============================================ | 62% | |============================================= | 64% | |============================================== | 65% | |=============================================== | 67% | |================================================ | 68% | |================================================= | 69% | |================================================== | 71% | |=================================================== | 72% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 76% | |====================================================== | 78% | |======================================================= | 79% | |======================================================== | 81% | |========================================================= | 82% | |========================================================== | 83% | |=========================================================== | 85% | |============================================================ | 86% | |============================================================= | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 94% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 99% | |======================================================================| 100%
end_time = Sys.time()
end_time-start_time
#> Time difference of 0.9779961 secs
The above functions carry out the experiment plan in experimentPlan.csv and return the results in a list. The following are the elements in the list:
View(verificationResults$combat_minmax)
View(verificationResults$feature_sets)
View(verificationResults$accuracy_mat)
View(verificationResults$accuracy_df)
In our current experiment plan, there are 6 training-validation data set pairs:
The negative control, Binomial, does not require any form of training since it is the accuracy generated by 12 (number of feature sets) binomial distributions with N = 8 (because there are eight samples in the validation set) and P = 0.5 (probability of guessing the correct label for each validation sample).
The increasing number of doppelgangers in validation is used to illustrate the dosage dependent behaviour of doppelgangers.
Here we load in the experiment plan from a comma separated file. The experiment plan specifies the names of samples in each training set and validation set. Care has been taken to prevent any leakage between training and validation sets of 0-8 Doppel.
To visualize the effect of the PPCC data doppelgangers on validation accuracy, we pass the functionalityResults (output list of doppelgangerFunctionalityVerification) to the displayFunctionalityResults function.
ori_train_valid_names = c("Doppel_0","Doppel_2", "Doppel_4", "Doppel_6", "Doppel_8", "Neg_Con", "Pos_Con")
new_train_valid_names = c("0 Doppel", "2 Doppel", "4 Doppel", "6 Doppel", "8 Doppel", "Binomial", "Perfect Leakage")
visualiseVerificationResults(verificationResults,
ori_train_valid_names,
new_train_valid_names)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
We observe a dosage dependent relationship between the number of doppelgangers and the accuracy of models on the validation set since accuracy increases as the number of doppelgangers in validation set increases.
In this tutorial, we will be demonstrating how functional doppelgangers can be identfiied in a RNA-Seq data set.
library("doppelgangerIdentifier")
The loaded data set is a preprocessed subset of the GSE81538 RNA-Seq data set. Details of the preprocessing steps are detailed in “./tutorial/dataset/dataset_preprocessing_information.txt”.
bc = readRDS("tutorial/dataset/bc_her2_tut.rds")
bc_meta = readRDS("tutorial/dataset/bc_her2_meta_tut.rds")
Since the data set used is an RNA-Seq data set, Combat-Seq will be used as the batch correction method prior to PPCC value calculation.
# Get PPCC Data Doppelgangers
start_time = Sys.time()
doppel_bc = getPPCCDoppelgangers(
raw_data = bc,
meta_data = bc_meta,
do_batch_corr = TRUE,
do_min_max = TRUE,
batch_corr_method = "ComBat_seq"
)
#> [1] "1. Batch correcting the 2 data sets with sva:ComBat_seq..."
#> Found 2 batches
#> Using null model in ComBat-seq.
#> Adjusting for 0 covariate(s) or covariate level(s)
#> Estimating dispersions
#> Fitting the GLM model
#> Shrinkage off - using GLM estimates for parameters
#> Adjusting the data
#> [1] "- Data is min-max normalized"
#> [1] "2. Calculating PPCC between samples of each batch..."
#> | |= | 1% | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========= | 14% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================ | 24% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================= | 34% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |===================================== | 54% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |========================================================== | 84% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
#> [1] "3. Labelling Sample Pairs according to their Class and Patient Similarities..."
#> [1] "4. Calculating PPCC cut off to identify PPCC data doppelgangers..."
#> [1] "5. Identifying PPCC data doppelgangers..."
end_time = Sys.time()
end_time - start_time
#> Time difference of 1.783855 mins
visualisePPCCDoppelgangers(doppel_bc)
To find out if the identified PPCC DDs are functional doppelgangers, we create a experiment plan that incrementally increases the number of PPCC DD samples in the validation set. If we observe an increasing trend of random model accuracy with increasing number of PPCC DD samples in the validation set, then we can conclude that the identified PPCC DDs are FDs.
start_time = Sys.time()
veri_bc = verifyDoppelgangers(
experiment_plan_filename = "tutorial/experiment_plans/bc_ex_plan.csv",
raw_data = bc,
meta_data = bc_meta,
batch_corr_method = "ComBat_seq",
k=9,
size_of_val_set = 48,
feature_set_portion = 0.01
)
#> [1] "1. Loading Experiment Plan..."
#> [1] "2. Preprocessing data..."
#> [1] "- Batch correcting with sva:ComBat_seq..."
#> Found 2 batches
#> Using null model in ComBat-seq.
#> Adjusting for 0 covariate(s) or covariate level(s)
#> Estimating dispersions
#> Fitting the GLM model
#> Shrinkage off - using GLM estimates for parameters
#> Adjusting the data
#> [1] "- Carrying out min-max normalisation"
#> [1] "3. Generating Feature Sets..."
#> [1] "4. Training KNN models..."
#> | |= | 1% | |== | 3% | |=== | 4% | |==== | 6% | |===== | 7% | |====== | 8% | |======= | 10% | |======== | 11% | |========= | 12% | |========== | 14% | |=========== | 15% | |============ | 17% | |============= | 18% | |============== | 19% | |=============== | 21% | |================ | 22% | |================= | 24% | |================== | 25% | |================== | 26% | |=================== | 28% | |==================== | 29% | |===================== | 31% | |====================== | 32% | |======================= | 33% | |======================== | 35% | |========================= | 36% | |========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================= | 42% | |============================== | 43% | |=============================== | 44% | |================================ | 46% | |================================= | 47% | |================================== | 49% | |=================================== | 50% | |==================================== | 51% | |===================================== | 53% | |====================================== | 54% | |======================================= | 56% | |======================================== | 57% | |========================================= | 58% | |========================================== | 60% | |=========================================== | 61% | |============================================ | 62% | |============================================= | 64% | |============================================== | 65% | |=============================================== | 67% | |================================================ | 68% | |================================================= | 69% | |================================================== | 71% | |=================================================== | 72% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 76% | |====================================================== | 78% | |======================================================= | 79% | |======================================================== | 81% | |========================================================= | 82% | |========================================================== | 83% | |=========================================================== | 85% | |============================================================ | 86% | |============================================================= | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 94% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 99% | |======================================================================| 100%
end_time = Sys.time()
end_time - start_time
#> Time difference of 1.589066 mins
ori_train_valid_names = c("Doppel_0","Doppel_6", "Doppel_12", "Doppel_18", "Doppel_24", "Neg_Con", "Pos_Con_24")
new_train_valid_names = c("0 Doppel", "6 Doppel", "12 Doppel", "18 Doppel", "24 Doppel", "Binomial", "24 Perfect Leakage")
visualiseVerificationResults(
veri_bc,
original_train_valid_names = ori_train_valid_names,
new_train_valid_names = new_train_valid_names
)
Since a positive relationship between the number of PPCC DD samples and random model accuracy can be observed, we can conclude that all identified PPCC DDs are FDs.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.