knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(qprppedigree)
Pedigrees that are analysed by the software system PopReport must fullfill certain properties and must be formatted according to certain rules. The proporties and the fullfillment of the rules are checked by this package. The checks that are implemented in this package are described in this vignette.
The checks of the pedigrees are based on the following points
Animal IDs are the primary keys of an individual. Hence these IDs must be unique. This uniqueness is tested with the following statement using the function check_pedig_id()
. This function requires the path to an input pedigree as an argument. This package contains a number of test pedigrees which can be tested by the following statement.
# get path to test pedigrees (vec_ped_path <- list.files(system.file('extdata', package = 'qprppedigree'), full.names = TRUE))
The first pedigree is the test-pedigree that can be obtained from the GenMon-Website. The difference between the first and the second test-pedigree is that the first has an animal with 'F' as a sire. In the second pedigree, this sire is removed. The first pedigree is checked by the following statement.
check_pedig_id(ps_pedig_path = vec_ped_path[1], ps_id_col = '#animal')
The output of the test-pedigree indicates that all animal IDs are unique.
The checks of the remaining pedigrees are run by applying the function check_pedig_id
to all the pedigrees.
# run checks lapply(vec_ped_path[3:length(vec_ped_path)], check_pedig_id)
The output of the second pedigree shows that all animal IDs are unique. The output of the first pedigree shows a list of duplicated IDs. They must be removed.
Interesting quantities related to parents are
The following function call checks the properties of parents for the first test pedigree r vec_ped_path[2]
.
check_pedig_parent(ps_pedig_path = vec_ped_path[2], ps_id_col = '#animal', ps_sire_col = 'sire', ps_dam_col = 'dam', ps_col_bd = 'birth_date', pvec_sire_by = c('sire' = '#animal'), pvec_dam_by = c('dam' = '#animal'), pvec_sire_suffix = c(".animal", ".sire"), pvec_dam_suffix = c(".animal", ".dam"))
The checks for the other test pedigrees are done with the following apply-statement. This is possible, because these pedigrees have the same column headers.
lapply(vec_ped_path[3:length(vec_ped_path)], check_pedig_parent)
The processing of the pedigrees by PopReport and further by GenMon, requires that the columns of the pedigree contain objects of a certain data-type. The requirement of the data-types is checked by the function check_pedigree_datatypes()
. The following statement shows a few tests and experiments.
tbl_ped <- read_prp_pedigree(ps_pedig_path = vec_ped_path[1], ps_delim = '|')
The data-types can be checked using the function class()
.
class(tbl_ped[["#animal"]])
In the pedigree 'tbl_ped', the column 'sire' has an unexpected data-type which is shown by
class(tbl_ped[["sire"]])
The column birthdate can have a special data-type
class(tbl_ped[["birth_date"]])
A similar functionality can be achieved by the function readr::guess_parser()
.
readr::guess_parser(tbl_ped[["sire"]], guess_integer = TRUE)
For the column of dams
readr::guess_parser(tbl_ped[["dam"]], guess_integer = TRUE)
In case of a column with floating point numbers, we get
readr::guess_parser(tbl_ped[["introg"]], guess_integer = TRUE)
For each of the columns to be checked, we give the required data-type. This can be specified in a list
(l_req_dt <- list(col = c("#animal", "sire", "dam", "birth_date", "sex", "plz", "introg"), dtp = c("integer", "integer", "integer", "date", "character", "integer", "double")))
The check can be performed in a simple loop
require(readr) tbl_par_problem <- NULL for (idx in seq_along(l_req_dt$col)) { cat(" * Checking column: ", l_req_dt$col[idx], "\n") s_cur_parser <- readr::guess_parser(tbl_ped[[ l_req_dt$col[idx]]], guess_integer = TRUE) if (s_cur_parser != l_req_dt$dtp[idx]){ cat(" *** ERROR: column ", l_req_dt$col[idx], " wrong datatype\n") parfun <- match.fun(paste('parse_', l_req_dt$dtp[idx], sep = '')) par_result <- parfun(tbl_ped[[l_req_dt$col[idx]]]) tbl_par_problem <- problems(par_result) } } tbl_par_problem
The check is implemented in the function check_pedigree_datatypes()
. This function requires as an input a list with column names and required datatypes. Such a list can be obtained using the function get_pedigree_datatypes()
.
(l_ped_dt_result <- get_pedigree_datatypes(ps_pedig_path = vec_ped_path[1]))
The list required as input for check_pedigree_datatypes()
is obtained as the component l_dtype
of the result of get_pedigree_datatypes()
by
l_ped_dt_result$l_dtype
From this result, we can see that the column of "sire-IDs" has datatype r l_ped_dt_result$l_dtype$dtp[which(l_ped_dt_result$l_dtype$col == "sire")]
. This is not correct, because all ID-columns should have the same datatype. The problematic entry in the sire-ID column can be found by the result of the function check_pedigree_datatypes()
.
The check can now be run by
l_dtype <- l_ped_dt_result$l_dtype l_dtype$dtp[which(l_dtype$col == "sire")] <- "integer" (l_dtp_check_result <- check_pedigree_datatypes(ps_pedig_path = vec_ped_path[1], pl_dtype = l_dtype))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.