snip: Run entire SNIP deduplication algorithm

Description Usage Arguments Value

View source: R/snip.R

Description

Run entire SNIP deduplication algorithm

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
snip(
  pedigrees,
  requestID,
  isProband,
  keyVars,
  keyVars.male = NULL,
  keyVars.female = NULL,
  keyWt = NULL,
  blockVar = NULL,
  repSN = 1,
  windowSN = 10,
  keyLength = length(keyVars),
  method = "intersection",
  thresh = 1:7,
  priority,
  dateFormat = NULL,
  printRuntime = TRUE,
  seed = NULL
)

Arguments

pedigrees

Pedigree data to deduplicate.

requestID

Column that has the ID for the family.

isProband

Column that indicates the proband.

keyVars

Character vector of column names for the variables in the sort key

keyVars.male

Optional character vector of column names for the variables in the sort key that are specific to males

keyVars.female

Optional character vector of column names for the variables in the sort key that are specific to females

keyWt

Numeric vector of weights assigned to variables in the sort key, corresponding to keyVars. If NULL, the standard deviations of the variables in the data will be used as weights.

blockVar

Vector of column names for the blocking variables, where families in different blocks will not be considered when searching for duplicates.

repSN

Number of iterations when sorting neighbors according to the sort key

windowSN

Integer representing the size of the sliding window to use during sorted neighbors.

keyLength

Numeric vector representing the number of key variables (out of c(keyVar.bin, keyVar.cont)) to concatenate per sort key. If missing, the key lengths will be randomly generated. The length of the vector should be repSN.

method

If "intersection", we use the intersection score. If "greedy", we use the greedy match score. If "both", we use both.

thresh

Vector of thresholds. If method = "intersection", then a pair is considered neighbors if the intersection match score is greater than the threshold. If method = "greedy", then the threshold is treated as a percentile, and a pair is considered neighbors if the greedy match score is greater than the percentile. If method = "both", then the user should provide a list, such as list(intersection = 1:7, threshold = c(0.8, 0.9)).

priority

A list of structure (var = 'Varx', min = TRUE) with 'Varx' being a character value corresponding to a column in rawData. This parameter determines how to sort the duplicates. If min = TRUE, then we use the minimum value of 'Varx' for each duplicate entity. Otherwise, we use the maximum value.

dateFormat

Character string of the format of the date. This is only used if the priority variable is a date. The format should match the formats of class POSIXlt used in the base::strptime function.

printRuntime

If TRUE, will print the runtime

seed

Seed

Value

An object of class Duplicates containing the duplicate entities and representatives for each duplicate entity (including singletons without duplicates).


bayesmendel/snipR documentation built on Jan. 25, 2022, 12:33 a.m.