require(PedigreeFromTvdData)
require(dplyr)
require(magrittr)

Disclaimer

This vignette describes implementation of checks that are described in the companion vignette on strategies entitled Pedigree Checks - Concepts and Strategies.

Checks

The vignette on the strategy defined checks to be routines that search for pedigree records that do not fullfill certain requirement conditions. The different type of checks have different requirements in computing resources. The check whether the pedigree contains any cycles is quite expensive to compute. One possible way to implement this check is to use an algorithm that computes the toplogical sort of the pedigree nodes. This construction does not give a result, whenever the pedigree contains any loops. For that reason this check is currently not implemented.

Checking for unique IDs

As the result returned from checks are lists of primary keys of pedigree records, we must ensure that these primary keys are available. This must be done right after reading the pedigree from the input file.

The pedigree is read using the following statement

sDataFileName <- system.file(file.path("extdata","KLDAT_20170524_10000.txt"), 
                             package = "PedigreeFromTvdData")
tbl_ped_uni_id <- laf_open_fwf_tvd_input(ps_input_file = sDataFileName, pb_out = TRUE)

As mentioned above, we first check for uniqueness of the animal-IDs. This is done with the following function call

tbl_non_uni <- check_unique_animal_id(tbl_ped_uni_id, pb_out = TRUE)
# check whether there are any non-unique ids
if (nrow(tbl_non_uni) > 0){
  cat(" *** Nr of duplicate IDs found: ", nrow(tbl_non_uni), "\n")
} else {
  cat(" *** No duplicate IDs found: ", nrow(tbl_non_uni), "\n")
}

For a more detailed check of the function, we have created a data set that contains duplicate IDs. This constructed dataset is now used for a further test.

s_data_file_dup_id <- system.file(file.path("extdata","KLDAT_20171214_dup_id_100.txt"), 
                             package = "PedigreeFromTvdData")
tbl_ped_dup_id <- laf_open_fwf_tvd_input(ps_input_file = s_data_file_dup_id, pb_out = TRUE)
(tbl_non_uni_dup_id <- check_unique_animal_id(tbl_ped_dup_id, pb_out = TRUE))

If you want to know which records are duplicate then

(tbl_ped_dup_id %>% filter(V12==tbl_non_uni_dup_id$Animal[1]))

Whenever, the check for uniqueness of the IDs is successful the property of the maximum in-degree of every node is also fullfilled. Hence we do not need to write a separate function for a check on the maximum in-degree.

Parents older than offspring

The check whether parents are older than offspring is done with the function check_parent_older_offspring() which has to be called for dams and sires separately. Per default the difference of age between parents and offspring have to be bigger than 1 year pn_date_diff_tol = 10^4.

For the morthers

(l_tvd_id_col_dsch <- getTvdIdColsDsch())
(n_bd_col_idx <- getBirthdateColIdxDsch())
(tbl_fail_mother_age <- check_parent_older_offspring(ptbl_pedigree = tbl_ped_uni_id, 
                                         pn_offspring_col = l_tvd_id_col_dsch$TierIdCol, 
                                         pn_birthday_col = n_bd_col_idx, 
                                         pn_parent_col = l_tvd_id_col_dsch$MutterIdCol))

The same for the fathers

(tbl_fail_father_age <- check_parent_older_offspring(ptbl_pedigree = tbl_ped_uni_id, 
                                         pn_offspring_col = l_tvd_id_col_dsch$TierIdCol, 
                                         pn_birthday_col = n_bd_col_idx, 
                                         pn_parent_col = l_tvd_id_col_dsch$VaterIdCol))

Parents sex consistency

The check if the sex of the parents are consistent is done with the function check_sex_tbl() which has to be called for dams and sires separately.

For mothers and fathers:

(tbl_result_consistencysex <- check_sex_tbl(ptblPedigree = tbl_ped_uni_id))

We have created a data set that contains mother with inconsistent sex. This constructed dataset is now used for a further test

sDataFileSex <- system.file(file.path("extdata","KLDAT_20170524_100_mother_sex.txt"), 
                             package = "PedigreeFromTvdData")
tbl_ped_sex <- laf_open_fwf_tvd_input(ps_input_file = sDataFileSex, pb_out = TRUE)
(tbl_result_sex <- check_sex_tbl(ptblPedigree = tbl_ped_sex))

Correct format of IDs for individuals and parents

The format of TVD-IDs will be checked: position 1 to 2 have to be letters and the rest numbers. This function hast to be call each time to check tvdid of individual, mother and father.

Here is an exemple with the check for individual:

sDataFileName <- system.file(file.path("extdata","KLDAT_20170524_10000.txt"),package = "PedigreeFromTvdData")
tbl_ped_uni_id <- laf_open_fwf_tvd_input(ps_input_file = sDataFileName, pb_out = TRUE)
tbl_ped_uni_id <- tbl_ped_uni_id[c(1:100),]

lIdCols <- getTvdIdColsDsch()
(tbl_result_tvdformat <- correct_tvd_format_tbl(p_tbl_ped = tbl_ped_uni_id,
                                   plFormatBorder = getTVDIdBorder(),
                                   plIdCols = lIdCols,
                                   pnIdCol = lIdCols$TierIdCol))

We have created a data set that contains inconsistent IDs format. This constructed dataset is now used for a further test

sData_NotCorrectId <- system.file(file.path("extdata","KLDAT_20170524_20_notcorrectId.txt"),
                                  package ="PedigreeFromTvdData")
tbl_ped_correctFormat <- laf_open_fwf_tvd_input(ps_input_file = sData_NotCorrectId, pb_out = TRUE)
lIdCols <- getTvdIdColsDsch()
(tbl_result_tvdformat <- correct_tvd_format_tbl(p_tbl_ped = tbl_ped_correctFormat,
                                   plFormatBorder = getTVDIdBorder(),
                                   plIdCols = lIdCols,
                                   pnIdCol = lIdCols$TierIdCol))

TO DO: Check influence of too short Ids or records!

Correct format of bithdates

The format of birthdates will be checked: days (1-31), months (1-12), years (>1950)

(tbl_result_checkbirthdate <- check_birthdate_tbl(ptblPedigree = tbl_ped_uni_id))

We have created a data set that contains inconsistent Birthdates. This constructed dataset is now used for a further test

sData_NotCorrectBirthdate <- system.file(file.path("extdata","KLDAT_20170524_20_notcorrectBirthdate.txt"),
                                  package ="PedigreeFromTvdData")
tbl_ped_NcorrectBirthdate <- laf_open_fwf_tvd_input(ps_input_file = sData_NotCorrectBirthdate, pb_out = TRUE)

(tbl_result_checkbirthdate_nC <- check_birthdate_tbl(ptblPedigree = tbl_ped_NcorrectBirthdate))

Test with a big data set

Checking for unique IDs

(mb.uniqId <- microbenchmark::microbenchmark(
  data.base = PedigreeFromTvdData::check_unique_animal_id(ptbl_pedigree = tbl_ped_uni_id),
  times = 5, unit = "s"))

Parents older than offspring

l_tvd_id_col_dsch <- getTvdIdColsDsch()
n_bd_col_idx <- getBirthdateColIdxDsch()
(mb.parentOld <- microbenchmark::microbenchmark(
  data.base = PedigreeFromTvdData::check_parent_older_offspring(ptbl_pedigree = tbl_ped_uni_id,
                                                                pn_offspring_col = l_tvd_id_col_dsch$TierIdCol, 
                                                                pn_birthday_col = n_bd_col_idx, 
                                                                pn_parent_col = l_tvd_id_col_dsch$MutterIdCol),
  times = 5, unit = "s"))

Parents sex consistency

(mb.sexConsistency <- microbenchmark::microbenchmark(
  data.base = PedigreeFromTvdData::check_sex_tbl(ptblPedigree = tbl_ped_uni_id),
  times = 5, unit = "s"))

Correct format of IDs for individuals

lIdCols <- getTvdIdColsDsch()
(mb.format <- microbenchmark::microbenchmark(
  data.base = PedigreeFromTvdData::correct_tvd_format_tbl(p_tbl_ped = tbl_ped_uni_id,
                                                          pnIdCol = lIdCols$TierIdCol),
  times = 5, unit = "s"))

Correct format of bithdates

(mb.birthdate <- microbenchmark::microbenchmark(
  data.base = PedigreeFromTvdData::check_birthdate_tbl(ptblPedigree = tbl_ped_uni_id),
  times = 5, unit = "s"))

Overview of results The following table shows the median times (column MedianTime) in seconds for all function that were used.

vecMethods <- c("PedigreeFromTvdData::check_unique_animal_id",
                "PedigreeFromTvdData::check_parent_older_offspring",
                "PedigreeFromTvdData::check_sex_tbl",
                "PedigreeFromTvdData::correct_tvd_format_tbl",
                "PedigreeFromTvdData::check_birthdate_tbl")
vecMedTimes <- c(mb.uniqId$time[3],
                 mb.parentOld$time[3],
                 mb.sexConsistency$time[3],
                 mb.format$time[3],
                 mb.birthdate$time[3])
nMinTime <- vecMedTimes[order(vecMedTimes)][1]
dfMedTime <- data.frame(Methode = vecMethods,
                        MedianTime = round(vecMedTimes*10^(-9), digits = 4))
knitr::kable(dfMedTime)

Session Info

sessionInfo()

Latest Update

r paste(Sys.time(),paste0("(", Sys.info()[["user"]],")" ))



pvrqualitasag/PedigreeFromTvdData documentation built on Feb. 16, 2018, 2:20 a.m.