#vec_pkgs <- c("pryr", "readxl","magrittr", "dplyr")
#if (!all(sapply(vec_pkgs, 
#       function(x) x  %in% installed.packages(), USE.NAMES = FALSE))){
#  for (pgk in vec_pkgs){
#    if (!pgk %in% installed.packages())
#      install.packages(pkgs = pgk)
#  }  
#}
require(pryr)
require(magrittr)
require(dplyr)
require(PedigreeFromTvdData)

Consistency Checks

Besides reading the data, checking for certain consistency properties is also required in our project. So far, we have read our data into a nested list of lists which is a specialized data structure. The goal of this section is to adapt the consistency checks for the resulting data structure of one of the faster reading methods.

Data Input

Based on our comparison of reading fwf-pedigree-data into R, we use package LaF to read the data and directly convert the data read into a tbl_df. This is done in the function laf_open_fwf_tvd_input of package PedigreeFromTvdData.

sDataFileName <- system.file(file.path("extdata","KLDAT_20170524_10000.txt"), 
                             package = "PedigreeFromTvdData")
vecFormat <- c(22,14,3,31,8,14,3,1,14,3)
tbl_ped <- laf_open_fwf_tvd_input(ps_input_file = sDataFileName,
                                  pvec_col_position = vecFormat)

Besides the time it takes to read pedigree data, we are also interested in the amount of memory it takes to store the pedigree data. We are using package pryr to get information about memory usage. For more information about R memory usage, we refer to Advanced R. The size in memory of the complete pedigree is

pryr::object_size(tbl_ped)

The total amount of memory used by all objects is

pryr::mem_used()

Checking format of TVD-Ids

In a first consistency check, we are correcting for wrongly formatted TVD-Ids. In the test dataset that we used so far, all IDs are correctly formatted, hence the original pedigree will be the same as the corrected version after the verification.

lIdCols <- getTvdIdCols()
system.time(tbl_ped_id_checked <- check_tvd_id_tbl(ptblPedigree = tbl_ped,
                                                   plIdCols = lIdCols))

The timing values indicate how long it takes to do the checks. The checks are done by giving the original pedigree as input and the checked and modified pedigree is obtained as a function result. In the modified pedigree any IDs that are recognized as incorrect are set to NA. In order to monitor the amount of memory used during the check, we use the function pryr::mem_change().

pryr::mem_change(tbl_ped_id_checked <- check_tvd_id_tbl(ptblPedigree = tbl_ped,
                                                        plIdCols = lIdCols))

Because the change in memory is very small, we can see that the method of consistency check should not have a negative impact on the memory usage.

The consistency check for the TVD-Ids is successful, if the original pedigree and the verified pedigree after the check are the same.

# check whether ID-columns of original and checked pedigrees are the same
all(tbl_ped_id_checked[[lIdCols$TierIdCol]] == tbl_ped[[lIdCols$TierIdCol]])

Implementation of Pedigree Checks

A more detailed description of a series of consistency checks is given in a separate vignette.

Resources

Session Info

sessionInfo()

Latest Update

r paste(Sys.time(),paste0("(", Sys.info()[["user"]],")" ))



pvrqualitasag/PedigreeFromTvdData documentation built on May 29, 2019, 7:50 a.m.