getPPCCDoppelgangers: A function to identify PPCC Data Doppelgangers

View source: R/getPPCCDoppelgangers.R

getPPCCDoppelgangersR Documentation

A function to identify PPCC Data Doppelgangers

Description

This function performs the following steps to identify PPCC data dopplgangers between batches:

  1. Batch correct batches with sva::ComBat

  2. Calculate PPCC values between samples of different batches

  3. Label sample pairs according to their patient id and class similarities

  4. Calculate PPCC cut off point (maximum PPCC of any "Different Class Different Patient" sample pair)

  5. Identify PPCC Data Doppelgangers as sample pairs with "Same Class Different Patient" labels with PPCC values > PPCC cut-off.

Usage

getPPCCDoppelgangers(
  raw_data,
  meta_data,
  do_batch_corr = TRUE,
  correlation_function = cor,
  batch_corr_method = "ComBat",
  do_min_max = FALSE
)

Arguments

raw_data

Data frame where each column is a sample and each row is a variable where rowname of each row is the variable name.

meta_data

Data frame with the columns "Class", "Patient_ID", "Batch" indicating the class, patient id and batch of the sample respectively and each row is a sample name. Ensure the sample names are row names of the data frame not a separate column in the data set.

do_batch_corr

If False, no batch correction is carried out before doppelgangers are found

correlation_function

Correlation function use. Pearson's Correlation Coefficient is used as the default correlation function. User defined functions should accept two vector parameters, x and y.

batch_corr_method

Batch correlation method used. Only 2 options are accepted "ComBat" or "ComBat_seq".

do_min_max

If True, min max normalization is carried out just before PPCC calulation

Details

This function also identifies PPCC data doppelgangers within a batch (if only 1 batch is detected in the metadata document). In this case it performs the following steps:

  1. Calculate PPCC values between samples within the batch

  2. Label sample pairs according to their patient id and class similarities

  3. Calculate PPCC cut off point (maximum PPCC of any "Different Class Different Patient" sample pair)

  4. Identify PPCC Data Doppelgangers as sample pairs with "Same Class Different Patient" labels with PPCC values > PPCC cut-off.

Troubleshooting Tips:

  1. Ensure all (rownames) samples in the meta_data can be found in the colnames in the raw_data and vice versa.

Value

A list containing the PPCC matrix and data frame and a list of doppelgangers identified

Examples

ppccDoppelgangerResults = getPPCCDoppelgangers(rc, rc_metadata)

lr98769/doppelgangerIdentifier documentation built on Aug. 2, 2022, 9:41 a.m.