knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Disclaimer

This document describes different strategies on how pedigree transformations can be implemented. A general description of strategies related to pedigree checks and transformations is available in a separate vignette on Pedigree Checks - Concepts and Strategies.

Introduction

Pedigree transformations are functions that take a pedigree as input and check the pedigree for a given consistency criterion. The output of a transformation function is a modified pedigree that fullfills the given consistency criterion.

Data structures

The pedigree is read from a text-file and is stored as a dataframe. Currently we are using the tibble- or tbl_df-version of dataframes. More information about tibbles can also be obtained from the chapter on tibbles in R for DataScience.

Operations

When looking at the required operations that are needed to transform any given pedigree into a consistent pedigree, only the following two operations are needed.

  1. Removing a complete record from a pedigree
  2. Setting one field of a certain record in the pedigee to NA

In what follows the two operations are described in more details

Removing complete records

The operation that removes a complete record from a pedigree takes as input a pedigree and a primary key of a record. In our case IDs of animals are used as primary keys. The output of the operation is the pedigree that does no longer contain the specified record. From that description, we can derive the following pre-conditions (in pseudo-code) for the operation

# declarations:
  pedigree : tbl_df dataframe with pedigree information
  record_tbd_pk : primary key of record to be deleted

# pre-conditions:
  pedigree is not null
  record_tbd_pk is in pedigree[pk_column]

A simple example - removing one record

We are reading a simple pedigree into a tbl_df and remove a given record

require(magrittr)
require(PedigreeFromTvdData)
### # define name of pedigree file
sDataFileName <- system.file(file.path("extdata","KLDAT_20170524_10000.txt"), 
                             package = "PedigreeFromTvdData")
### # read pedigree
tbl_pedigree <- laf_open_fwf_tvd_input(ps_input_file = sDataFileName, pb_out = TRUE)
#head(tbl_pedigree)
#tail(tbl_pedigree)

Now we specify a primary key of a record that should be deleted

(rec_tbd_pk <- "CH120006405592")

The corresponding record should be deleted

res_tbl_pedigree <- tbl_pedigree %>% dplyr::filter(V12 != rec_tbd_pk)

The new pedigree must have one row fewer than the original pedigree and it must not contain the deleted record

dim(tbl_pedigree)
dim(res_tbl_pedigree)

Now we search for the deleted record and it should not be found anymore.

res_tbl_pedigree %>% dplyr::filter(V12 == rec_tbd_pk)

Removing a series of records

The above operation of deleting a single record from a pedigree can be done easily using the dplyr::filter() function. Now we want to extend the functionality to be able to remove a series of records. The series of records to be deleted is identified by a vector of primary keys, or alternatively by a dataframe that contains the primary keys of the records to be removed.

(vec_rec_tbd_pk <- c("CH120001976905", "CH120006405592", "CH120001807094", "CH120003434748"))

We want to create a operation that discards all records with the above primary keys. One approach is to try it with set-operations. See the following learn data science blog for a more detailed description

res_tbl_pedigree <- tbl_pedigree %>% dplyr::filter(!V12 %in% vec_rec_tbd_pk)
dim(res_tbl_pedigree)

The search for the primary keys in vec_rec_tbd_pk must not find any results in the transformed pedigree which is shown below.

res_tbl_pedigree %>% dplyr::filter(V12 %in% vec_rec_tbd_pk)

Creating a function

The above shown operations are fine as proof of concepts, but they only become useful when we can put them into a function. So let us give it a try

remove_rec <- function(ptbl_pedigree, 
                       pvec_rec_tbr_pk, 
                       pn_pk_col_idx = getTvdIdColsDsch()$TierIdCol){
  res_tbl_pedigree <- ptbl_pedigree %>% dplyr::filter(!.[[pn_pk_col_idx]] %in% pvec_rec_tbr_pk)
  return(res_tbl_pedigree)
}

Testing the above created function remove_rec()

res_tbl_pedigree_func <- remove_rec(ptbl_pedigree = tbl_pedigree,
                                    pvec_rec_tbr_pk = vec_rec_tbd_pk)

The resulting dataframe res_tbl_pedigree_func must be identical to what we obtained before the function as the dataframe res_tbl_pedigree.

dplyr::setequal(res_tbl_pedigree, res_tbl_pedigree_func)

Searching the deleted records shows

res_tbl_pedigree_func %>% dplyr::filter(V12 %in% vec_rec_tbd_pk)

Setting a field in a record to NA

The second operation requires more input, because we also have to know which field in a given record must be set to NA. The description in pseudo code is similar to the first operation.

# declarations:
  pedigree : tbl_df dataframe with pedigree information
  record_pk : primary key of record from which a field should be set to NA
  field_idx : column index of field to be set to NA

# pre-conditions:
  pedigree is not null
  record_pk is in pedigree[pk_column]
  field_idx less than or equal ncol(pedigree)

A simple example - setting one field of one record to NA

From the already known pedigree tbl_pedigree, we set one field of a given record to NA. The record is identified by its primary key.

(s_rec_pk <- "CH120003746483")

Then we assume that we want to set the column entitled with "V5" to NA.

(tbl_pedigree_sna <- tbl_pedigree %>% dplyr::mutate(V5 = replace(V5, which(V12 == s_rec_pk), NA)))

We check by comparing the original record for this primary key and the one after applying the operation. First, we start with the original record

tbl_pedigree %>% dplyr::filter(V12 == s_rec_pk) %>% dplyr::select(V5,V11,V12,V16)

Then we compare it to the modified record

tbl_pedigree_sna %>% dplyr::filter(V12 == s_rec_pk) %>% dplyr::select(V5,V11,V12,V16)

Extending this to setting a field to NA in several record

The several records are again defined by a vector containing the primary keys of the records for which a given field should be set to NA. The idea that we want to use here is to define an index vector of all rows corresponding to the records in which the fields should be set to NA.

vec_rec_pk <- c("CH120001976905", "CH120006405592", "CH120001807094", "CH120003434748")
vec_row_idx <- sapply(vec_rec_pk, function(x) which(tbl_pedigree$V12 == x), USE.NAMES = FALSE)
tbl_pedigree_vna <- tbl_pedigree %>% dplyr::mutate(V5 = replace(V5, vec_row_idx, NA))

To check whether the field V5 is set to NA in the desired records, we can run the following checks.

tbl_pedigree_vna %>% dplyr::filter(V12 %in% vec_rec_pk) %>% dplyr::select(V5,V11,V12,V16)

Alternatives to mutate - replace

When trying to convert the above statements into a function, we found it difficult to come up with a working solution. The following suggestions are based on a RStudio blog post. We are first looking into the section Vector Functions and later into the section of Predicate Functions to come up with a usable solution for our problem.

Vector Functions

For that reason, we are trying to use the dplyr-verb case_when().

tbl_pedigree_cw <- tbl_pedigree %>% dplyr::mutate(V5 = dplyr::if_else(V12 %in% vec_rec_pk, NA_character_, V5))

Checking again with the same test

tbl_pedigree_cw %>% dplyr::filter(V12 %in% vec_rec_pk) %>% dplyr::select(V5,V11,V12,V16)

Other records should not have the field V5 set to NA

head(tbl_pedigree_cw)

Predicate Functions

Predicate functions such as mutate_if() do changes of columns but based on conditions/predicates that are applied to columns not to rows. Hence those are not really helpful here.

Setting fields to NA using a function

Analogously to the first operation, we also want to define a function that implements this operation

#' Setting one field of a list of records in a pedigree to NA
#' 
#' @param ptbl_pedigree pedigree as tbl_df
#' @param pvec_rec_pk vector of primary keys identifying the records
#' @param ps_field_name column name in ptbl_pedigree where certain fields must be set to NA
#' @param ps_pk_name column name where primary keys can be found
#' @return tbl_pedigree_result pedigree with given fields set to NA
set_field_na <- function(ptbl_pedigree, 
                         pvec_rec_pk, 
                         ps_field_name,
                         ps_pk_name){
  ### # create an expression from the column name of primary keys
  expr <- dplyr::enquo(ps_pk_name)
  ### # create an expression from the column name where the fields 
  ### #  are that should be set to NA
  var_expr <- dplyr::enquo(ps_field_name)
  ### # create the variable name which is used for the lhs of the 
  ### #  assignement
  var_name <- dplyr::quo_name(var_expr)

  tbl_pedigree_result <-
    dplyr::mutate(ptbl_pedigree,
                  !!var_name := dplyr::if_else((!!expr) %in% pvec_rec_pk, NA_character_, !!var_expr))

  return(tbl_pedigree_result)
}

The function call looks as follows

tbl_pedigree_vna_func <- set_field_na(tbl_pedigree, vec_rec_pk, V5, V12)
set_field_na(tbl_pedigree, vec_rec_pk, V11, V12)

Checks

Checking whether fields are set to NA

tbl_pedigree_vna_func %>% dplyr::filter(V12 %in% vec_rec_pk) %>% dplyr::select(V5,V11,V12,V16)

And the following two pedgirees must be the same

dplyr::setequal(tbl_pedigree_vna, tbl_pedigree_vna_func)

Session Info

sessionInfo()

Latest Update

r paste(Sys.time(),paste0("(", Sys.info()[["user"]],")" ))



pvrqualitasag/PedigreeFromTvdData documentation built on May 29, 2019, 7:50 a.m.