This vignette is intended to go through a working example of implementing the SNIP algorithm on an example pedigree data set. We first load the snipR package, along with the tidyverse package.
library(snipR) library(tidyverse)
We will be deduplicating the pedigrees data set, which is already included in the package. The data represent pedigrees that are ascertained based on family history, and include information on cancer statuses and ages, genetic testing results, etc. Each family has a FamID; duplicate families will have the same FamID. Each family has a unique RequestID. Hence, the goal of the algorithm is to detect the duplicates, or the families with the same FamID. Within a family, each individual has an ID, and the MotherID and FatherID represent the ID of the mother and father, respectively. A portion of the data set is provided below. The data is mostly in the format for running the Mendelian model PanelPRO (https://projects.iq.harvard.edu/bayesmendel/panelpro).
head(pedigrees)
We now run SNIP on the data set. We use the cancer statuses and ages, the genetic testing results, and the family size as the key variables. We use the default parameters for the algorithm: 1 sort key iteration, a sliding window size of 10, a key length of 19 (the number of key variables), and all 7 thresholds. We use the breast cancer age (AgeBC) as the priority variable, where for each cluster of duplicates, we choose the family with the proband with the minimum breast cancer age.
cancers <- c("BC", "OC", "COL", "ENDO", "PANC", "MELA") genes <- c("BRCA1", "BRCA2", "MLH1", "MSH2", "MSH6", "CDKN2A") keyVars <- c(paste0("isAff", cancers), paste0("Age", cancers), genes, "famSize") keyVars.female <- c("isAffOC", "isAffENDO") res <- snip(pedigrees = pedigrees, requestID = "RequestID", isProband = "isProband", keyVars = keyVars, keyVars.female = keyVars.female, priority = list(var = "AgeBC", min = TRUE), seed = 1)
The output of the algorithm is an object of the "Duplicates" class, which is a list of 9 objects:
head(res$rawData) res$ID res$keyVars head(res$Neighbors) res$keysUsed res$method head(res$dupsList$thresh_5) head(res$dupsReps$thresh_5) res$details
To obtain a deduplicated version of the original data, say using the results from the algorithm with a threshold of 5, we can run the following:
pedigrees.dedup <- pedigrees %>% filter(RequestID %in% res$dupsReps$thresh_5)
We can evaluate the performance of the deduplication using three metrics: pairwise F1, cluster F1, and generalized merge distance.
metrics <- summaryMetrics(object = res, requestID = "RequestID", famID = "FamID", thresh.i = 1:7) metrics
For thresholds 1 and 2, metrics were not evaluated because the number of clusters was less than half the true number of clusters. This will cause the calculation of the metrics to be computationally intensive, and hence we skip it since having so few clusters will inherently indicate poor performance.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.