This vignette describes concepts, strategies and systematics developed for running data consistency checks for pedigree data.
From a theoretical point of view, pedigrees are Directed Acyclic Graphs (DAG), where a graph $G$ is defined by the set of vertices or nodes $V$ and the set of edges $E$. The set of vertices correspond to the set of individuals in the pedigree. Edges correspond to directed relationships from parents to their offspring. In a pedigree, the direction of the edges is always from parents to offspring. A pedigree must not contain any directed cycles, i.e. there cannot be any path corresponding to a series of directed edges linked together where a parent appears after one of its offspring. These two properties
together have led to the name DAG.
In a directed graph, we can distinguish between the number of edges that are coming into a node and the number of edges going out of a node. The former number is called in-degree
and the latter number corresponds to the out-degree
. For a pedigree, the maximum in-degree for every node is $2$.
Based on the described properties, we can define a set of consistency requirements that must be fullfilled when a pedigree is constructed.
This section describes consistency requirements that are derived from the properties of a pedigree described in the previous section.
The following list of requirements is derived from the properties of a DAG
Uniqueness of individuals: Considering the fact that a pedigree is a DAG and the definition of the nodes as a set, it follows that nodes must be unique. Hence every individual can only appear once in our pedigree. As a consequence of that, we can use the IDs associated to the nodes as unique mean of identification. In the database-world this corresponds to a primary key, meaning that a primary key uniquely identifies a given record.
No cycles: The property of a pedigree being acyclic was already described in the previous section
In-degree of nodes: The maximum in-degree of every node is $2$.
Parents older than offspring: Parents must be older than offspring
Parents must have the correct sex
There are additional properties which are more related to data-processing issues. Those issues mostly involve the correctnes of certain data-formats.
Implementations do depend among many things on the type of data representation of a given pedigree.
One of the most commonly used data representation of a pedigree is the so-called node-list or adjacency-list. This is a tabular list with columns containing IDs for animals, IDs for parents and additional information such as sex and birthdate. One row of the list corresponds to the available information for a given individual and hence must be unique. Such a row is also called a pedigree record.
When it comes to verifying the consistency requirements two type of implementation routines can be imagined
The descriptions of each check is described in a companion vignette on Pedigree Checks - Implementations which is also available in this package.
The descriptions of each transformation is described in a companion vignette on Pedigree Transformation - Implementations which is also available in this package.
Reading Fixed Width Formatted Data: comparison of different methods for reading large amounts of data.
Pedigree Checks - Implementations: description of each implemented check-function.
Pedigree Consistency Checks: time and memory requirement for some check-functions.
Strategies for Pedigree Transformations: description of different method to delete record or invalidate a field.
Pedigree Transformation - Implementations: description of each implemented transformation-functions.
Main Pedigree Build: description of implementation of each check-funtion and transformation-function in one function to call TVD-Data.
CheckList for TODOs: checklist to keep in mind by merging programmation.
User Manual: Manual.
sessionInfo()
r paste(Sys.time(),paste0("(", Sys.info()[["user"]],")" ))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.