In mPloenzke/bsnR:

x--- title: "bsnR Workflow" author: "Matthew Ploenzke" date: "r Sys.Date()" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{bsnR Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

This vignette is intended to showcase the typical workflow for creating and manipulating the objects created in the bsnR package.

Dataset

The pedigrees dataset simulated above includes records for simulated family pedigrees.

rm(list=ls(all=TRUE))
library(bsnR)
library(tidyverse)
data('pedigrees')
self <- pedigrees %>%
  filter(ID==1)
mothers <- pedigrees %>% 
  group_by(requestId) %>%
  arrange(ID, .by_group=TRUE) %>%
  filter(ID==MotherID[1]) %>%
  ungroup %>% as.data.frame()
fathers <- pedigrees %>% 
  group_by(requestId) %>%
  arrange(ID, .by_group=TRUE) %>%
  filter(ID==FatherID[1]) %>%
  ungroup %>% as.data.frame()
mothersmothers <- pedigrees %>% 
  group_by(requestId) %>%
  arrange(ID, .by_group=TRUE) %>% 
  mutate(tempID = MotherID[1]) %>% 
  filter(n()>1) %>%
  filter(ID == MotherID[ID==tempID]) %>% 
  ungroup %>% as.data.frame()
mothersfathers <- pedigrees %>% 
  group_by(requestId) %>%
  arrange(ID, .by_group=TRUE) %>% 
  mutate(tempID = MotherID[1]) %>% 
  filter(n()>1) %>%
  filter(ID == FatherID[ID==tempID]) %>% 
  ungroup %>% as.data.frame()
fathersmothers <- pedigrees %>% 
  group_by(requestId) %>%
  arrange(ID, .by_group=TRUE) %>% 
  mutate(tempID = FatherID[1]) %>% 
  filter(n()>1) %>%
  filter(ID == MotherID[ID==tempID]) %>% 
  ungroup %>% as.data.frame()
fathersfathers <- pedigrees %>% 
  group_by(requestId) %>%
  arrange(ID, .by_group=TRUE) %>% 
  mutate(tempID = FatherID[1]) %>% 
  filter(n()>1) %>%
  filter(ID == FatherID[ID==tempID]) %>% 
  ungroup %>% as.data.frame()

Setting key variables

Define the key variables below for the sorted neighbors. These should correspond to columns in the dataset under consideration for which we eventually wish to include in the duplicate pair match scoring.

# variables to be used in generating the sort keys
binary.v <- c("Gender","AffectedBreast","AffectedOvary","AffectedColon","AffectedEndometrium","AffectedPancreas","AffectedSkin","Twins","BRCA1","BRCA2","MLH1","MSH2","MSH6","P16")
cont.v <- c("AgeBreast","AgeOvary","AgeColon","AgeEndometrium","AgePancreas","AgeSkin")
vars <- c(binary.v,cont.v, "senderIP")

Initialize a Neighbors object

bsnR is heavily dependent upon various S3 classes, the first of which is the Neighbors class. Below we initialize an object of this class by providing the raw data, a character input defining which column to be used as the ID variable, and the previously defined key variables.

# generate object of Neighbors class
selfs.neighbsObj <- Neighbors(self, ID="requestId",keyVars=vars)
summary(selfs.neighbsObj)
# repeat for all family members
mothers.neighbsObj <- Neighbors(mothers, ID="requestId",keyVars=vars)
fathers.neighbsObj <- Neighbors(fathers, ID="requestId",keyVars=vars)
mothersmothers.neighbsObj <- Neighbors(mothersmothers, ID="requestId",keyVars=vars)
mothersfathers.neighbsObj <- Neighbors(mothersfathers, ID="requestId",keyVars=vars)
fathersmothers.neighbsObj <- Neighbors(fathersmothers, ID="requestId",keyVars=vars)
fathersfathers.neighbsObj <- Neighbors(fathersfathers, ID="requestId",keyVars=vars)

Perform BSN

We may now perform an iteration of the blocked sorted neighbors algorithm. The input must be a Neighbors object created in the previous step. The algorithm partitions the IDs based on the blocking variable provided and then repeats sorted neighbors iterations, each time collecting those neighbors within the specified sorted neighbors window. This creates a Blocks object, which is essentially a Neighbors object which now contains information about the BSN call.

# perform an iteration of BSN algorithm blocking on IP address
selfs.blocksObj <- blockedSN(selfs.neighbsObj, blockVar='senderIP', repSN=1, windowSN=2)
summary(selfs.blocksObj)
# repeat for all family members
mothers.blocksObj <- blockedSN(mothers.neighbsObj, blockVar='senderIP', repSN=1, windowSN=2)
fathers.blocksObj <- blockedSN(fathers.neighbsObj, blockVar='senderIP', repSN=1, windowSN=2)
mothersmothers.blocksObj <- blockedSN(mothersmothers.neighbsObj, blockVar='senderIP', repSN=1, windowSN=2)
mothersfathers.blocksObj <- blockedSN(mothersfathers.neighbsObj, blockVar='senderIP', repSN=1, windowSN=2)
fathersmothers.blocksObj <- blockedSN(fathersmothers.neighbsObj, blockVar='senderIP', repSN=1, windowSN=2)
fathersfathers.blocksObj <- blockedSN(fathersfathers.neighbsObj, blockVar='senderIP', repSN=1, windowSN=2)

It is simple to perform another call of the BSN algorithm and append the new neighbors.

# and perform another
selfs.blocksObj <- blockedSN(selfs.blocksObj, blockVar='BRCA1', repSN=10, windowSN=1)
summary(selfs.blocksObj)

We may also see the blocking variables and sort keys used on the various calls.

print(selfs.blocksObj$keysUsed)

Generating background

Background neighbors are potential duplicate pairs found by randomly choosing two neighbors from the list of all IDs. In theory, the BSN algorithm should outperform these background pairs due to the nature of the algorithm. Here we find an equal number of background neighbors as we currently have from the previous BSN calls.

# randomly generate potential pairs to compare to
## num>1 -> count, num<=1 -> proportion of current neighbors found via BSN
selfs.blocksObj <- sampleBackground(selfs.blocksObj, num=1)

We may leverage the pedigree information by intersecting the found pairs across defined family member types (father, mother, etc.) and only retaining those pairs which were found a sufficient number of times (here we set the threshold to 2).

self.pairs <- cbind(selfs.blocksObj$Neighbors,member="self")
mothers.pairs <- cbind(mothers.blocksObj$Neighbors,member="mothers")
fathers.pairs <- cbind(fathers.blocksObj$Neighbors,member="fathers")
mothersmothers.pairs <- cbind(mothersmothers.blocksObj$Neighbors,member="mothersmothers")
mothersfathers.pairs <- cbind(mothersfathers.blocksObj$Neighbors,member="mothersfathers")
fathersmothers.pairs <- cbind(fathersmothers.blocksObj$Neighbors,member="fathersmothers")
fathersfathers.pairs <- cbind(fathersfathers.blocksObj$Neighbors,member="fathersfathers")
all.pairs <- rbind(self.pairs,mothers.pairs,fathers.pairs,mothersmothers.pairs,
                   mothersfathers.pairs,fathersmothers.pairs,fathersfathers.pairs)
unique.pairs <- as.tibble(all.pairs) %>% 
  group_by(ID1, ID2) %>%
  mutate(found=n()) %>%
    filter(n()>2) %>% 
  filter(row_number()==1) %>% 
  select(ID1,ID2,background)
  ungroup %>% as.data.frame

blocksObj <- selfs.blocksObj
blocksObj$Neighbors <- as.matrix(unique.pairs)

Scoring neighbors

We now promote the Blocks object to a Scores object by scoring each pair of records. That is to say each pair of neighbors found previously are scored based on the key variables and their associated weights. The scoring is based on simple matching: string variables and binary variables are scored 0/1 for non-match/match and continuous (numeric) variables are scored based on a simple distance with an maximum value of 1 for match and 1/(abs(difference)+1) for non-matches.

# get match score (not incorporating pedigree structure)
scoresObj <- scoreNeighbors(blocksObj)
summary(scoresObj)

Distributions may also be visualized by calling plot on the Scores object.

plot(scoresObj, type="box")
plot(scoresObj, type="density")

Greedy pedigree scoring

Researchers may also wish to leverage the known pedigree structure inherent in the data. To do so, we instead score the Blocks object by calling scoreFamilies and specifying the full dataset and the necessary ID variables. Here ID refers to the column in the datasets for an individuals ID (e.g. all Selfs are ID==1). MotherID and FatherID point to the mothers' and fathers' own ID values, respectively. Note the scoring takes much longer as each member type of the candidate pedigree is greedily matched against member types in the comparison pedigree to attempt to maximize the match score and leverage the known parent-child relationships. In other words, all siblings of an individual (ID==1) in pedigree 1 are compared to all siblings of the individual in pedigree 2.

# or get match score incorporating greedy pedigree matching
famscoresObj <- scoreFamilies(blocksObj, pedigrees, "ID", "MotherID","FatherID")
summary(famscoresObj)
plot(famscoresObj, type="box")
plot(famscoresObj, type="density")

Deduplication

We are now ready to define a threshold and see how the BSN algorithm has performed. The threshold input is a proportion such that only those candidate duplicate pedigrees above the percentile are defined as true duplicate pairs. In the deDuplicate call below pedigrees will be assigned to duplicate entities; these entities are comprised of all families passing the threshold and completing the associative set (e.g. if family A and family B are candidate pairs passing the threshold, as are families A and C, then the duplicate entity consists of families A, B, C). The priority input defines how to sort these families within the entity. The function call returns a Duplicates object. The above scoring procedures are useful when the variables exhibit sufficient variability; in our dataset, however, the pedigrees are relatively homogeneous and thus the scoring procedure does not provide significant gains. For this reason we set a very low threshold (0.01).

# deduplicate on arbitrary threshold
orderon <- list(VAR="AgeBreast", DESC=TRUE)
duplicatesObj <- deDuplicate(scoresObj, thresh=.01, priority=orderon)
summary(duplicatesObj)

We may again plot these to see how the scores above the threshold defined as duplicate pairs compare to those below.

plot(duplicatesObj, type="density")
plot(duplicatesObj, type="box")
plot(duplicatesObj, type="family counts")

Output formatting

The Duplicates object from above is not so useful for data analysis. Thus it is now time to transform it back to a useable list reflecting the analysis and duplicate ordering defined previously. Note if the user wishes to remove any pedigrees from an duplicate entity, they may do so by simply removing that row from the entity held in the Duplicates dupsList slot.

# format back to a list
cleanList <- reformatDuplicates(duplicatesObj)

The big reveal

Should labels exist we may now check our accuracy.

# reveal accuracy if labels exist
selfs.deduped <- revealTruth(duplicatesObj, truelabels="FamilyID", duplabels="requestId")

Simulating new pedigrees

You may also simulate your own duplicated pedigrees. See sample code below for formatting. The BayesMendel package is a dependency for this and you will need to source the script from the R folder.

# create a simulation dataset with duplicates and miscodings
library(BayesMendel)
source('SimulateFamily.R')
pedigrees <- makeSimDataset(nFams=500,pDup=.05,singleP=0,stdP=.95,
                     parents=1,gparents=1,sibs=.9,ants=.8,cuz=.25,errRt=1.5,
                     numUnique=8,birthrt=c(3,2))
row.names(pedigrees) <- NULL
pedigrees <- pedigrees[,1:(ncol(pedigrees)-2)]
pedigrees <- pedigrees[,-ncol(pedigrees)]