This vignette is intended to go through a working example of implementing the SNIP algorithm on an example pedigree data set. We first load the snipR package, along with the tidyverse package.

library(snipR)
library(tidyverse)

Dataset

We will be deduplicating the pedigrees data set, which is already included in the package. The data represent pedigrees that are ascertained based on family history, and include information on cancer statuses and ages, genetic testing results, etc. Each family has a FamID; duplicate families will have the same FamID. Each family has a unique RequestID. Hence, the goal of the algorithm is to detect the duplicates, or the families with the same FamID. Within a family, each individual has an ID, and the MotherID and FatherID represent the ID of the mother and father, respectively. A portion of the data set is provided below. The data is mostly in the format for running the Mendelian model PanelPRO (https://projects.iq.harvard.edu/bayesmendel/panelpro).

head(pedigrees)

SNIP

Running the algorithm

We now run SNIP on the data set. We use the cancer statuses and ages, the genetic testing results, and the family size as the key variables. We use the default parameters for the algorithm: 1 sort key iteration, a sliding window size of 10, a key length of 19 (the number of key variables), and all 7 thresholds. We use the breast cancer age (AgeBC) as the priority variable, where for each cluster of duplicates, we choose the family with the proband with the minimum breast cancer age.

cancers <- c("BC", "OC", "COL", "ENDO", "PANC", "MELA")
genes <- c("BRCA1", "BRCA2", "MLH1", "MSH2", "MSH6", "CDKN2A")
keyVars <- c(paste0("isAff", cancers), paste0("Age", cancers),
             genes, "famSize")
keyVars.female <- c("isAffOC", "isAffENDO")
res <- snip(pedigrees = pedigrees, requestID = "RequestID", isProband = "isProband", keyVars = keyVars,
            keyVars.female = keyVars.female, priority = list(var = "AgeBC", min = TRUE), seed = 1)

Results

The output of the algorithm is an object of the "Duplicates" class, which is a list of 9 objects:

  1. rawData: Copy of the raw data (pedigrees)
  2. ID: Name of the requestID variable
  3. keyVars: Character vector of the key variables (keyVars)
  4. Neighbors: Matrix of 3 columns, where each row represents two families that were considered neighbors for at least one of the seven core relative types. The columns are the requestIDs of the two families, as well as the number of relative types (out of 7) where the two famililes were found to be neighbors.
  5. keysUsed List of the blocking variables (blockVar) and key variables (keyVars) used in each iteration of the algorithm. In each iteration, we randomly select from the key variables, where the number of key variables we select is determined by the key length parameter (keyLength).
  6. method Type of match score used. The default is "intersection", but can also be "greedy" or "both".
  7. dupsList List containing the duplicate clusters. Each threshold has its own list, and each a list for each threshold, where each threshold has a list where each element is a vector of requestIDs that are considered duplicates.
  8. dupsReps List containing a vector for each threshold, where each threshold has a vector containing the requestIDs of the representatives of each duplicate cluster.
  9. details A list indicating the thresholds and priority variable
head(res$rawData)
res$ID
res$keyVars
head(res$Neighbors)
res$keysUsed
res$method
head(res$dupsList$thresh_5)
head(res$dupsReps$thresh_5)
res$details

Deduplicated data

To obtain a deduplicated version of the original data, say using the results from the algorithm with a threshold of 5, we can run the following:

pedigrees.dedup <- pedigrees %>% filter(RequestID %in% res$dupsReps$thresh_5)

Evaluating performance

We can evaluate the performance of the deduplication using three metrics: pairwise F1, cluster F1, and generalized merge distance.

metrics <- summaryMetrics(object = res, requestID = "RequestID", famID = "FamID", thresh.i = 1:7)
metrics

For thresholds 1 and 2, metrics were not evaluated because the number of clusters was less than half the true number of clusters. This will cause the calculation of the metrics to be computationally intensive, and hence we skip it since having so few clusters will inherently indicate poor performance.



bayesmendel/snipR documentation built on Jan. 25, 2022, 12:33 a.m.